dvc: status: it's slow
Bug Report
Description
status
seems slow.
Reproduce
$ git clone git@github.com:iterative/example-get-started.git
$ cd example-get-started
$ dvc pull
$ dvc status
Cloning
It lingers there for a few seconds, before changing to Data and pipelines are up to date.
Before 2.0 this was pretty much instantaneous. A few users have reported this on Discord BTW (2 in #q-and-a earlier today, one of them mentioned add
being slow too). @efiop mentioned it could be related to the new Dulwich implementation.
Expected
Instantaneous report, esp. for such a simple project as in the example above.
Environment information
Output of dvc doctor
:
$ dvc version
DVC version: 2.0.5
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-Ubuntu-18.04-bionic
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git
Additional Information (if any):
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21 (13 by maintainers)
👋
I’m working with a huge dataset (5TB) and running
status
orcommit
command is becoming a bit annoying. I would expect them to take a lot of time but it’s taken ages (I mean days to process) and before computing hashes, they don’t give any output whatsoever. Hence I have no way of knowing if the process is dead or what.So I decided to check if there is any way of improving the code and I found this thread.
Profiling
I’ve profiled the
status
command as done above but only for two different stages of my pipeline:The first one processes ~106GB of data (~292370 files). It uses the LibriSpeech ASR dataset

The second processes ~50GB of data (~268270 files). It uses the TED-LIUM dataset

So I noticed two things:
Comparing times with Bash
I’ve done the comparison only for the TED-LIUM dataset (second process)
So comparing to os.walk this takes significantly less time DVC --> 538s/60=8.966m
And hash computation is more or less the same depending on which number you take, but still above what it can be accomplished. DVC --> 660s/60=11m
Propositon
Hi there! After almost one month the process finished.
dump.prof.gz
For the past experience
async
is a little bit usually slower thanmulti-processing
.So, this week I worked again with this issue and I tried to explore in more depth DVC code.
I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.
I was wondering if this could be improve by using cooperative multitasking instead. I’ve read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.
Furthermore, this could be scale also with the available cores in the machine, one producer the
os.walk
and at least one consumer which will get the hash value (more on this in this interesting talk)PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular
Indeed
Yes, the cache is located on an NFS.
Actually, it is 12,601,130
Cool, that’s great.
@alealv If you could add
--cprofile-dump dump.prof
to any of those commands and share that file here - that would be enough for us to take a closer look.I haven’t run the same tests, but I can say that I ran
dvc status
(for the whole 5TB dataset) for more than 3 hours and it didn’t finish. Alsodvc commit
run for more than 5 days and I stopped it because I didn’t know if it was stalled or what.It would be best if we coordinate on which tests to perform or create a specific repo to reproduce this.
Ok, so it looks like the status issue is actually separate. It’s still a dulwich/git issue, but related to gitignore performance.
dvc status
in the DVC git repo:The
add
/repro
performance issue should maybe go into its own GH issue, as it’s related to the git status/git-track reminder (although gitignore performance probably also affects those commands as well)It’s not related to cloning. It’s pretty much done whenever we load stages and yes it’s what is also affecting the
add
performance. Or at least, the issue that should be resolved by the linked PR is definitely the cause of theadd
andrepro
performance issues. There maybe an unrelatedstatus
issue as well though?Based on the discord discussion it’s probably because of https://github.com/iterative/dvc/pull/5544
We can revert the status check change for now, I’ll look into seeing if using pygit2’s status implementation is faster