dvc: Pull extrememly slow on ~400GB of data with hot DVC cache
OS: Docker image based off of tensorflow/tensorflow:latest-gpu-py3
, with Python 3.7.5 as the system Python.
Initial setup:
# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6
DVC cache configuration:
# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"
Please note that the DVC cache is hot. In other words, most if not all files for dvc fetch
are present and up-to-date at /ssd/...
.
Make a fresh clone and profile dvc pull
:
# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...
This dvc pull
, uninstrumented, usually takes 40+ minutes with a hot DVC cache.
Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:
# find . -type l | wc -l
29003
# du -shL .
403G .
Looking at the dvc_pull.prof
(in KCachegrind) suggests that the bottleneck is the checksum process. The file_md5
and dos2unix
functions in utils/__init__.py
appear particularly costly.
Is this a known issue? Would the primary authors of DVC entertain a more performant version of file_md5
(perhaps written in C/C++ and without TQDM integration)?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 37 (22 by maintainers)
Commits related to this issue
- dvc: use protected mode by default Fixes #3261 Fixes #2041 — committed to efiop/dvc by efiop 4 years ago
- dvc: use protected mode by default Fixes #3261 Fixes #2041 — committed to efiop/dvc by efiop 4 years ago
- dvc: use protected mode by default (#3472) * dvc: use protected mode by default Fixes #3261 Fixes #2041 * unprotect: adjust help message https://github.com/iterative/dvc.org/pull/1058/files... — committed to iterative/dvc by efiop 4 years ago
@efiop This test replikates @kevlar1818 situation:
So to me it seems that with each new instance of docker image, md5’s are unnecessary calculated. The “trust” flag seems like a way to go.
There is no way for us to tell if the cache is shared or not.
I’ve also mentioned
protected
based approach, where we set read-only for all cache that we add and if it is still read-only then we assume the cache file is not corrupted and so trust its hash. That way we won’t need special options.@pared mentined that the thing that is happening here could be reproduced with simple
dvc checkout
, because the cache is already there, there is no need to pull it. And the issue with that is that.dvc/state
is empty, so it needs to verify all the new cache files coming from an external cache dir. We do trust things we download during pull, but we don’t trust local caches, as they might be corrupted much more easilly. Need to verify that it is the thing that is going on here and if it is so- consider adding a config option to trust cache dir (but there might be a better solution too, need to think about it).@dvaldivia The MinIO instance is not running in distributed mode.
@efiop I’ll try to reproduce the issue with a minimal example first, and then we can revisit having a debugging session. Should I open a separate Github issue concerning the misclassification of binary files as text files? I still think that’s another facet to this that could/should be addressed regardless of this bug.
@efiop As for a video call, let me get back to you on that. I’d have to carve out some time at work.
Looking into this further, it appears DVC is wrongly treating the binary files as text files (via istextfile.py). This leads to the huge amount of
dos2unix
calls mentioned above. In my case, the files have large text headers followed by binary data.@casperdcl You’re right. At first I was worried about Python’s overhead given the sheer number of function calls apparent in the profiling. I didn’t do a great job summarizing my subsequent findings, so I’ll share them here for the benefit of others (e.g. @efiop, @mroutis, @pared).
I did some basic benchmarking of
md5sum
and DVC’sfile_md5
, and the two implementations were actually quite comparable in runtime. On a 5GB binary filefile_md5
was only perhaps 10% slower thanmd5sum
. The story changes when comparingmd5sum
andfile_md5
on a text file;file_md5
unfailingly takes twice as long to complete. The reason is fairly obvious: Thedos2unix
function indvc/utils/__init__.py
is calling.replace
on every chunk of the text file, sofile_md5
is literally reading every byte in the file twice. This is why I opened #3264, which I’m still hoping will be merged (either with a 1MB or 8KB chunk size) regardless of this apparent bug being fixed or not.@efiop I’ll have to re-run with profiling enabled on Monday.
@pared Rerunning pull on the same repo (not a new clone) seems to produce the same runtime as the first pull.
The hope is that the repo can be cloned in a new docker container whenever a developer wants to work on the project. So I’d bin our use case in your first description above. That being said, all developers share the
/ssd
cache.@efiop seems like a bug to me, dos2unix should not be called at all as it’s invoked only by file_md5, which supposedly should not happen after https://github.com/iterative/dvc/pull/3200. Looking into that.