dvc: Pull extrememly slow on ~400GB of data with hot DVC cache

OS: Docker image based off of tensorflow/tensorflow:latest-gpu-py3, with Python 3.7.5 as the system Python.

Initial setup:

# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6

DVC cache configuration:

# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"

Please note that the DVC cache is hot. In other words, most if not all files for dvc fetch are present and up-to-date at /ssd/....

Make a fresh clone and profile dvc pull:

# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...

This dvc pull, uninstrumented, usually takes 40+ minutes with a hot DVC cache.

Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:

# find . -type l | wc -l
29003
# du -shL .
403G    .

Looking at the dvc_pull.prof (in KCachegrind) suggests that the bottleneck is the checksum process. The file_md5 and dos2unix functions in utils/__init__.py appear particularly costly.

Is this a known issue? Would the primary authors of DVC entertain a more performant version of file_md5 (perhaps written in C/C++ and without TQDM integration)?

dvc_pull_prof.zip

Screenshot_2020-01-30_14-02-20

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 3
Comments: 37 (22 by maintainers)

Commits related to this issue

dvc: use protected mode by default Fixes #3261 Fixes #2041 — committed to efiop/dvc by efiop 4 years ago
dvc: use protected mode by default Fixes #3261 Fixes #2041 — committed to efiop/dvc by efiop 4 years ago
dvc: use protected mode by default (#3472) * dvc: use protected mode by default Fixes #3261 Fixes #2041 * unprotect: adjust help message https://github.com/iterative/dvc.org/pull/1058/files... — committed to iterative/dvc by efiop 4 years ago

Most upvoted comments

@efiop This test replikates @kevlar1818 situation:

def test_checkout_calculation_on_no_state(tmp_dir, dvc, mocker):
    tmp_dir.dvc_gen("file", "file content")
    (tmp_dir / ".dvc" / "state").unlink()
    (tmp_dir / "file").unlink()

    file_md5_counter = mocker.spy(dvc_module.remote.local, "file_md5")

    dvc.checkout()
    assert file_md5_counter.call_count == 1

So to me it seems that with each new instance of docker image, md5’s are unnecessary calculated. The “trust” flag seems like a way to go.

pared on Mar 3, 2020

There is no way for us to tell if the cache is shared or not.

I’ve also mentioned protected based approach, where we set read-only for all cache that we add and if it is still read-only then we assume the cache file is not corrupted and so trust its hash. That way we won’t need special options.

efiop on Mar 9, 2020

@pared mentined that the thing that is happening here could be reproduced with simple dvc checkout, because the cache is already there, there is no need to pull it. And the issue with that is that .dvc/state is empty, so it needs to verify all the new cache files coming from an external cache dir. We do trust things we download during pull, but we don’t trust local caches, as they might be corrupted much more easilly. Need to verify that it is the thing that is going on here and if it is so- consider adding a config option to trust cache dir (but there might be a better solution too, need to think about it).

efiop on Feb 26, 2020

@dvaldivia The MinIO instance is not running in distributed mode.

@efiop I’ll try to reproduce the issue with a minimal example first, and then we can revisit having a debugging session. Should I open a separate Github issue concerning the misclassification of binary files as text files? I still think that’s another facet to this that could/should be addressed regardless of this bug.

kevin-hanselman on Feb 19, 2020

@efiop As for a video call, let me get back to you on that. I’d have to carve out some time at work.

kevin-hanselman on Feb 11, 2020

Looking into this further, it appears DVC is wrongly treating the binary files as text files (via istextfile.py). This leads to the huge amount of dos2unix calls mentioned above. In my case, the files have large text headers followed by binary data.

kevin-hanselman on Jan 30, 2020

from what I can see in the profiler results, tqdm is relatively negligible but let me know if it’s still considered unacceptably high.

@casperdcl You’re right. At first I was worried about Python’s overhead given the sheer number of function calls apparent in the profiling. I didn’t do a great job summarizing my subsequent findings, so I’ll share them here for the benefit of others (e.g. @efiop, @mroutis, @pared).

I did some basic benchmarking of md5sum and DVC’s file_md5, and the two implementations were actually quite comparable in runtime. On a 5GB binary file file_md5 was only perhaps 10% slower than md5sum. The story changes when comparing md5sum and file_md5 on a text file; file_md5 unfailingly takes twice as long to complete. The reason is fairly obvious: The dos2unix function in dvc/utils/__init__.py is calling .replace on every chunk of the text file, so file_md5 is literally reading every byte in the file twice. This is why I opened #3264, which I’m still hoping will be merged (either with a 1MB or 8KB chunk size) regardless of this apparent bug being fixed or not.

kevin-hanselman on Feb 13, 2020

@efiop I’ll have to re-run with profiling enabled on Monday.

kevin-hanselman on Feb 1, 2020

@pared Rerunning pull on the same repo (not a new clone) seems to produce the same runtime as the first pull.

The hope is that the repo can be cloned in a new docker container whenever a developer wants to work on the project. So I’d bin our use case in your first description above. That being said, all developers share the /ssd cache.

kevin-hanselman on Jan 31, 2020

@efiop seems like a bug to me, dos2unix should not be called at all as it’s invoked only by file_md5, which supposedly should not happen after https://github.com/iterative/dvc/pull/3200. Looking into that.

pared on Jan 31, 2020