dvc: status: it's slow

Bug Report

Description

status seems slow.

Reproduce

$ git clone git@github.com:iterative/example-get-started.git
$ cd example-get-started
$ dvc pull
$ dvc status
Cloning

It lingers there for a few seconds, before changing to Data and pipelines are up to date.

Before 2.0 this was pretty much instantaneous. A few users have reported this on Discord BTW (2 in #q-and-a earlier today, one of them mentioned add being slow too). @efiop mentioned it could be related to the new Dulwich implementation.

Expected

Instantaneous report, esp. for such a simple project as in the example above.

Environment information

Output of dvc doctor:

$ dvc version
DVC version: 2.0.5
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-Ubuntu-18.04-bionic
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (13 by maintainers)

Most upvoted comments

👋

I’m working with a huge dataset (5TB) and running status or commit command is becoming a bit annoying. I would expect them to take a lot of time but it’s taken ages (I mean days to process) and before computing hashes, they don’t give any output whatsoever. Hence I have no way of knowing if the process is dead or what.

So I decided to check if there is any way of improving the code and I found this thread.

Profiling

I’ve profiled the status command as done above but only for two different stages of my pipeline:

The first one processes ~106GB of data (~292370 files). It uses the LibriSpeech ASR dataset
The second processes ~50GB of data (~268270 files). It uses the TED-LIUM dataset

So I noticed two things:

Most of the time is spent in a deadlock (code section)
The other big chunk is because of os.walk (code section)

Comparing times with Bash

I’ve done the comparison only for the TED-LIUM dataset (second process)

List files:

aalvarez@ml1:/mnt/data/dvc/tmp/data/tedlium_waves$ time ls . > /dev/null

real    0m2,656s
user    0m0,607s
sys     0m0,350s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f > /dev/null

real    0m26,133s
user    0m0,222s
sys     0m3,540s

So comparing to os.walk this takes significantly less time DVC --> 538s/60=8.966m

Computing hashes

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f -exec md5sum {} \; > /dev/null

real    12m4,800s
user    5m2,927s
sys     3m47,566s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f | parallel --progress md5sum > /dev/null

Computers / CPU cores / Max jobs to run
1:local / 36 / 36

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/268263/100%/0.0s

real    7m38,746s
user    10m26,780s
sys     9m43,082s

And hash computation is more or less the same depending on which number you take, but still above what it can be accomplished. DVC --> 660s/60=11m

Propositon

Truly parallelism would come by the hand of a multiprocessing.Pool instead of ThreadPoolExecutor, also I believed that it would be best to use imap_unorder
Maybe this is going too deep but looking at the os.walk code it first appends everything to a list and then it returns. Wouldn’t be better to use scandir() and yield?

alealv on Apr 20, 2021

Hi there! After almost one month the process finished.

dump.prof.gz

alealv on Aug 25, 2021

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I’ve read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

For the past experience async is a little bit usually slower than multi-processing.

karajan1001 on Mar 7, 2022

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I’ve read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

alealv on Mar 3, 2022

It finished successfully, right? And you were using 2.5.4? Just clarifying

Indeed

What I can see from the dump right away is that stat is around 100 times slower than on my machine, which is usually because of nfs and in your case it probably is But we need to take a closer look.

Yes, the cache is located on an NFS.

(from logs, it looks like your dataset is ~1.7 million of files, right?)

Actually, it is 12,601,130

Good news here is that since the last time…

Cool, that’s great.

alealv on Aug 26, 2021

@alealv If you could add --cprofile-dump dump.prof to any of those commands and share that file here - that would be enough for us to take a closer look.

efiop on Jul 22, 2021

I haven’t run the same tests, but I can say that I ran dvc status (for the whole 5TB dataset) for more than 3 hours and it didn’t finish. Also dvc commit run for more than 5 days and I stopped it because I didn’t know if it was stalled or what.

It would be best if we coordinate on which tests to perform or create a specific repo to reproduce this.

alealv on Jul 22, 2021

Ok, so it looks like the status issue is actually separate. It’s still a dulwich/git issue, but related to gitignore performance.

dvc status in the DVC git repo:

Screen Shot 2021-03-11 at 2 36 46 PM

The add/repro performance issue should maybe go into its own GH issue, as it’s related to the git status/git-track reminder (although gitignore performance probably also affects those commands as well)

pmrowla on Mar 11, 2021

It’s not related to cloning. It’s pretty much done whenever we load stages and yes it’s what is also affecting the add performance. Or at least, the issue that should be resolved by the linked PR is definitely the cause of the add and repro performance issues. There maybe an unrelated status issue as well though?

pmrowla on Mar 11, 2021

Based on the discord discussion it’s probably because of https://github.com/iterative/dvc/pull/5544

We can revert the status check change for now, I’ll look into seeing if using pygit2’s status implementation is faster

pmrowla on Mar 11, 2021