dvc: dvc status: 'up-to-date' but cache is corrupted

Bug Report

dvc status: ‘up-to-date’ but cache is corrupted

Description

Context: on a Windows machine (git-bash), performing a ‘dvc pull’ from an AWS-S3 bucket, the dvc-pull fails occasionally due to various network connection problems. This seems to be the origin of the following problem:

Problem:

  • dvc pull (possibly only a partial pull that fails with subsequent retrys until ‘everything is up to date’)
  • dvc status outputs : ‘data and pipelines are up to date’
  • one file within a dvc-added directory is obviously corrupted (in this case a wav-file with missing channel data)
  • tracking the associated .dir file through the dvc-cache, we pull the .json listing the md5sums of the associated files
  • we find the md5sum of the corrupted file
  • we recompute the md5sum (manually) on the corrupted file (within the cache) and (as expected due to corruption) it gives a different md5sum than its associated filename (which itself should be the md5sum)

Reproduce

  1. dvc pull
  2. (possibly) dvc pull if previous pull only partially succeeded
  3. dvc status (everything is up to date)
  4. Workflow identifies a corrupted file
  5. Manual re-hash of file within dvc-cache does not match its associated filename

Expected

  1. dvc status warns the user that the dvc-tracked .dir has changed and that another dvc pull is required
  2. and/or md5sum of files within the dvc-cache match with their filenames (at least checked once on initial pull of data)
  3. feature-request of a flag to force recheck of hashes for ‘dvc status’ command

Environment information

  • (mini)conda version: 4.11.0
  • python version: 3.10.9
  • dvc version: 2.43.4
  • Windows 10

Output of dvc doctor:

$ dvc doctor

DVC version: 2.43.4 (pip)
-------------------------
Platform: Python 3.10.9 on Windows-10-10.0.19044-SP0
Subprojects:
        dvc_data = 0.37.8
        dvc_objects = 0.19.3
        dvc_render = 0.1.0
        dvc_task = 0.1.11
        dvclive = 2.0.2
        scmrepo = 0.1.7
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.76)
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: s3
Workspace directory: NTFS on D:\
Repo: dvc, git

Additional information

Unfortunately I cannot generate a situation that reliably re-creates this corruption issue, it seems to arise from connectivity/network issues within our organization. I also have not encountered it on Mac/Linux (I’m submitting this on behalf of a colleague using Windows) so perhaps it is OS-specific. This may not be a true DVC bug, but we are seeking some guidance on how to avoid/detect such a corruption and have not found a good resource. We have found the setting for verification of the remote data, i.e.:

dvc remote modify myrepo verify true

which as I understand it will force the re-hash locally and likely detect our problem, but does seem that a manual alternative such as dvc status --verify should be available?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (11 by maintainers)

Most upvoted comments

Also, just note that md5sum output will not always match the DVC md5, depending on the file you are tracking. In DVC 2.x, DVC tries to determine whether or not the file contains text content, and in the event DVC thinks it is a text file it does a dos2unix conversion on the data before computing the MD5 hash (see https://github.com/iterative/dvc/issues/4658 for more details)

Enabling the verify option for the remote is the intended solution here.

a manual alternative such as dvc status --verify should be available?

There is currently no way to force this kind of check for local cache, but we can consider adding something like this in the future

For the record: had a meeting with @bardsleypt today. I got a pretty good idea of the layout/size of data that we are dealing with here and I’ll try to reproduce with minio + windows, just to see if it is feasible.

@efiop Sure, we can have a call. I’m a bit tied up early on this week, perhaps Thursday 6/29 or Friday 6/30? I’m free in the middle of the day 10am - 2pm (MDT) both days.

My colleague did look into the corruption further and found the following:

  • 4 channel .wav file
  • Part way through the file (in time, call it t=t*) the corruption occurs
  • After t* the channels permute by 1 and restart from t=0
    • Channels (1, 2, 3, 4) -> (2, 3, 4, 1)
    • E.g., info in channel 2 in the interval [0, t*] becomes the info in channel 1 in the interval [t*, end of file]
  • File size itself is larger but length of ‘audio data’ is the same (waveheader seems to come through fine)
    • Llikely all channel information is intact within the file, but the wavefile ordering/compression is corrupted

I have some screenshots I can share with you when we have a call to clairfy, but maybe this information is helpful in the meantime. Let me know what time works for you for a call and we can go from there.

@efiop no problem.

Looking at the URL for our remote, it is actually s3-compatible, not plain aws S3. It is https://***minio.ad.***.com, so minio, perhaps that is part of the issue?

The corruption (at least the only one we’ve encountered so far) manifests as the .wav file seemingly missing a sample(s), causing the channels to permute. The content may be correct but is placed in the wrong place upon opening it within any waveform analyzer. In any case, it definitely results in a .wav file with incorrect channel information and incorrect content. I can drill into the specifics if it is helpful.

Ahh, sorry. I see that we are using as_atomic there (I thought we had moved it back to respective filesystems).

the dvc-pull fails occasionally due to various network connection problems

I thought it was DVC that was corrupting cache due to connection issues. But yeah, reading closely, this is a feature request. Sorry for the noise. 😃

@skshetry I’m not following, s3fs uploads should still be atomic since incomplete s3 multipart_upload does not actually replace the final object with a partial upload. The final object in the bucket is not created until the complete_multipart_upload request is sent. Is the concern here that you could end up with a corrupted or partial chunk in the multipart request?

fetch atomicity should not depend on the underlying fs at all, since we handle downloading to the temporary path and then moving the resulting file on the successful download ourselves in dvc-objects.