dvc: dvc status: 'up-to-date' but cache is corrupted
Bug Report
dvc status: ‘up-to-date’ but cache is corrupted
Description
Context: on a Windows machine (git-bash), performing a ‘dvc pull’ from an AWS-S3 bucket, the dvc-pull fails occasionally due to various network connection problems. This seems to be the origin of the following problem:
Problem:
- dvc pull (possibly only a partial pull that fails with subsequent retrys until ‘everything is up to date’)
- dvc status outputs : ‘data and pipelines are up to date’
- one file within a dvc-added directory is obviously corrupted (in this case a wav-file with missing channel data)
- tracking the associated .dir file through the dvc-cache, we pull the .json listing the md5sums of the associated files
- we find the md5sum of the corrupted file
- we recompute the md5sum (manually) on the corrupted file (within the cache) and (as expected due to corruption) it gives a different md5sum than its associated filename (which itself should be the md5sum)
Reproduce
- dvc pull
- (possibly) dvc pull if previous pull only partially succeeded
- dvc status (everything is up to date)
- Workflow identifies a corrupted file
- Manual re-hash of file within dvc-cache does not match its associated filename
Expected
- dvc status warns the user that the dvc-tracked .dir has changed and that another dvc pull is required
- and/or md5sum of files within the dvc-cache match with their filenames (at least checked once on initial pull of data)
- feature-request of a flag to force recheck of hashes for ‘dvc status’ command
Environment information
- (mini)conda version: 4.11.0
- python version: 3.10.9
- dvc version: 2.43.4
- Windows 10
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.43.4 (pip)
-------------------------
Platform: Python 3.10.9 on Windows-10-10.0.19044-SP0
Subprojects:
dvc_data = 0.37.8
dvc_objects = 0.19.3
dvc_render = 0.1.0
dvc_task = 0.1.11
dvclive = 2.0.2
scmrepo = 0.1.7
Supports:
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
s3 (s3fs = 2023.6.0, boto3 = 1.26.76)
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: s3
Workspace directory: NTFS on D:\
Repo: dvc, git
Additional information
Unfortunately I cannot generate a situation that reliably re-creates this corruption issue, it seems to arise from connectivity/network issues within our organization. I also have not encountered it on Mac/Linux (I’m submitting this on behalf of a colleague using Windows) so perhaps it is OS-specific. This may not be a true DVC bug, but we are seeking some guidance on how to avoid/detect such a corruption and have not found a good resource. We have found the setting for verification of the remote data, i.e.:
dvc remote modify myrepo verify true
which as I understand it will force the re-hash locally and likely detect our problem, but does seem that a manual alternative such as dvc status --verify
should be available?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (11 by maintainers)
Also, just note that
md5sum
output will not always match the DVCmd5
, depending on the file you are tracking. In DVC 2.x, DVC tries to determine whether or not the file contains text content, and in the event DVC thinks it is a text file it does ados2unix
conversion on the data before computing the MD5 hash (see https://github.com/iterative/dvc/issues/4658 for more details)Enabling the
verify
option for the remote is the intended solution here.There is currently no way to force this kind of check for local cache, but we can consider adding something like this in the future
For the record: had a meeting with @bardsleypt today. I got a pretty good idea of the layout/size of data that we are dealing with here and I’ll try to reproduce with minio + windows, just to see if it is feasible.
@efiop Sure, we can have a call. I’m a bit tied up early on this week, perhaps Thursday 6/29 or Friday 6/30? I’m free in the middle of the day 10am - 2pm (MDT) both days.
My colleague did look into the corruption further and found the following:
I have some screenshots I can share with you when we have a call to clairfy, but maybe this information is helpful in the meantime. Let me know what time works for you for a call and we can go from there.
@efiop no problem.
Looking at the URL for our remote, it is actually s3-compatible, not plain aws S3. It is https://***minio.ad.***.com, so
minio
, perhaps that is part of the issue?The corruption (at least the only one we’ve encountered so far) manifests as the .wav file seemingly missing a sample(s), causing the channels to permute. The content may be correct but is placed in the wrong place upon opening it within any waveform analyzer. In any case, it definitely results in a .wav file with incorrect channel information and incorrect content. I can drill into the specifics if it is helpful.
Ahh, sorry. I see that we are using
as_atomic
there (I thought we had moved it back to respective filesystems).I thought it was DVC that was corrupting cache due to connection issues. But yeah, reading closely, this is a feature request. Sorry for the noise. 😃
@skshetry I’m not following, s3fs uploads should still be atomic since incomplete s3
multipart_upload
does not actually replace the final object with a partial upload. The final object in the bucket is not created until thecomplete_multipart_upload
request is sent. Is the concern here that you could end up with a corrupted or partial chunk in the multipart request?fetch atomicity should not depend on the underlying fs at all, since we handle downloading to the temporary path and then moving the resulting file on the successful download ourselves in dvc-objects.