dvc: DVC SSH cache of folder has a bug

Hi, I have just tried you the new SSh cache of folder feature. However, I get the error that the cache does not exist for the folder.

First, I reproduce the stage when the output folder does not exist. Thus, the cache is created as seen in the verbose log below:

DEBUG: Computed stage 'extract_sathub_dataset.dvc' md5: '2eb5b77a4f820a5e53ea0b5ddca6d529'
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir' expected 'c656521fb574b12bdde4cfc12b6d3bf5.dir' actual 'c656521fb574b12bdde4cfc12b6d3bf5'
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual 'None'
Saving 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_users/fogh/DNT/sathub_datasets/' to 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir'.
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual 'None'
DEBUG: Created 'copy': ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa -> ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_users/fogh/DNT/sathub_datasets/3501306.log
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a' expected '947fd0c5d28f7e3d9f7d0f349092944a' actual 'None'
DEBUG: Created 'copy': ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a -> ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_users/fogh/DNT/sathub_datasets/3501306.nc
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c' expected '222da66f14e694e6b14e86e771771c1c' actual 'None'
--------[the same log for all of the other files in the folder]-----------

However, then reproducing the stages again, the folder does not exist in the cache, as seen in the verbose log below.

DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/c6/56521fb574b12bdde4cfc12b6d3bf5.dir' expected 'c656521fb574b12bdde4cfc12b6d3bf5.dir' actual 'c656521fb574b12bdde4cfc12b6d3bf5'
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/38/b2609acfc2b2f24a998a3f9e290aaa' expected '38b2609acfc2b2f24a998a3f9e290aaa' actual '38b2609acfc2b2f24a998a3f9e290aaa'
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/94/7fd0c5d28f7e3d9f7d0f349092944a' expected '947fd0c5d28f7e3d9f7d0f349092944a' actual '947fd0c5d28f7e3d9f7d0f349092944a'
DEBUG: cache 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c' expected '222da66f14e694e6b14e86e771771c1c' actual '0aed37f9195162d51ed798b9ac3eb082'
WARNING: corrupted cache file 'ssh://fogh@ahsoka.vfltest.dk/scratch/dvc_project_cache/DNT/22/2da66f14e694e6b14e86e771771c1c'.
WARNING: Output 'remote://ahsoka_project_data/sathub_datasets/' of 'extract_sathub_dataset.dvc' changed because it is 'not in cache'
WARNING: Stage 'extract_sathub_dataset.dvc' changed.

I suspect that the “WARNING: corrupted cache file” creates the bug here, but why the cache is corrupted, I do not know?

My dvc version is

(py37_v7) fogh@ubuntuVM:~/work/seges/the_digital_national_trials/dvc_pipeline$ dvc version
DVC version: 0.60.0
Python version: 3.7.4
Platform: Linux-5.0.0-29-generic-x86_64-with-debian-buster-sid
Binary: False
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/sda1')

But as seen in the discord messages https://discordapp.com/channels/485586884165107732/563406153334128681/627079206760874005 I have also tried with version 0.54.1, but still with the same error.

I know this bug report is difficult to understand, so I’m welcoming any questions on this issue, but are also open for a call on Discord.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 20 (17 by maintainers)

Most upvoted comments

Copying won’t help anyway in general case as several processes might change out concurrently or someone just gets in between. Document this and maybe provide a HINT as @mroutis suggested is as much as we can do.

Ultimately we rely on users and their code to behave 😉

We can add an entry to the documentation about the dangers of running two processes working on the same file. Could be also a HINT under cache corruption warning.

@jorgeorpinel But it would also bring copying overhead in terms of the time it takes. Currently, we do move, which doesn’t take any time if we are moving within the safe FS, and if we start copying, it will always be slow. But even if we would sacrifice that, your workers might keep writing to that workspace file and so when you checkout next time, you might get an incomplete file from cache. So you still need to ensure that your program behaves well.

We probably need a concept of well behaving command:

  • only reads from specified deps
  • only writes to specified outs
  • completely rewrites outs (no appending, editing, etc)
  • stops reading/writing on command exit

Explain why is this important and then link to it from commands.

As @mroutis @pared noticed, copying approach will slow dvc by a lot, so we might want to consider a different approach. Maybe we could come up with another way? Need to research.

@PeterFogh Yeah, let’s not close it for now. This issue has brought a very good point, so we will consider a better approach at moving files to cache, that will not be susceptible to these kinds of issues and a bunch of related ones 😃 Thank you!

Here is an update on how I got my DVC stage (i.e. python script) to work. The problem was that the dask workers still (after the job was finished) had an open file handler to the log files, created as part of the dask.delayed jobs. I do not think the file content checked after the job was finished, however, I assume opened and closed files do not have the same checksum??

My solution was just to manually in the dask.delayed job to close the logging filehandle by

import logging

@dask.delayed
def task(...):
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    fh = get_log_filehandler(path_log)
    logger.addHandler(fh)

   ........the actual code for the task......

    for handler in logger.handlers[:]:
        handler.close()
        logger.removeHandler(handler)

@efiop, will you decide if the issue can be closed? 😃

Also, currently we move things to the cache as we go, which might be fragile if the destination doesn’t have enough storage space left. What we should really do is copy everything we need to cache and only then delete the sources, so if we break at some point, at least we have the original files intact. That approach would use additional storage space temporarily, but I think it is totally worth it.

For the record: https://discordapp.com/channels/485586884165107732/563406153334128681/628509561644646400

So what happened here, is that files in the output directory were still being written to even after the script finished running, probably due to incorrectly implemented shutdown procedure or maybe that is just how dask works and it is dask’s issue. So when dvc went on to save them, it sftp.renameed them, so the workers still had correct fd pointing to that renamed file that was now in the cache. When sequential dvc status was called, it detected that one log file had incorrect checksum and deleted it as corrupted, which triggered other workers to fail and corrupt cache for other log files, since they were also still attached to those.

This issue might be relevant not only for ssh, but for local too, as we shutil.move there, which might rename the file, causing the same type of issues. What we need to do is to change the inode when we move files to cache, so no other processes could be having access to our cache files.