dvc: status: it's slow

Bug Report

Description

status seems slow.

Reproduce

$ git clone git@github.com:iterative/example-get-started.git
$ cd example-get-started
$ dvc pull
$ dvc status
Cloning

It lingers there for a few seconds, before changing to Data and pipelines are up to date.

Before 2.0 this was pretty much instantaneous. A few users have reported this on Discord BTW (2 in #q-and-a earlier today, one of them mentioned add being slow too). @efiop mentioned it could be related to the new Dulwich implementation.

Expected

Instantaneous report, esp. for such a simple project as in the example above.

Environment information

Output of dvc doctor:

$ dvc version
DVC version: 2.0.5
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-Ubuntu-18.04-bionic
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (13 by maintainers)

Most upvoted comments

👋

I’m working with a huge dataset (5TB) and running status or commit command is becoming a bit annoying. I would expect them to take a lot of time but it’s taken ages (I mean days to process) and before computing hashes, they don’t give any output whatsoever. Hence I have no way of knowing if the process is dead or what.

So I decided to check if there is any way of improving the code and I found this thread.

Profiling

I’ve profiled the status command as done above but only for two different stages of my pipeline:

  1. The first one processes ~106GB of data (~292370 files). It uses the LibriSpeech ASR dataset profiling_librispeech_1 profiling_librispeech_2

  2. The second processes ~50GB of data (~268270 files). It uses the TED-LIUM dataset profiling_1 profiling_2

So I noticed two things:

  1. Most of the time is spent in a deadlock (code section)
  2. The other big chunk is because of os.walk (code section)

Comparing times with Bash

I’ve done the comparison only for the TED-LIUM dataset (second process)

  • List files:
aalvarez@ml1:/mnt/data/dvc/tmp/data/tedlium_waves$ time ls . > /dev/null

real    0m2,656s
user    0m0,607s
sys     0m0,350s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f > /dev/null

real    0m26,133s
user    0m0,222s
sys     0m3,540s

So comparing to os.walk this takes significantly less time DVC --> 538s/60=8.966m

  • Computing hashes
aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f -exec md5sum {} \; > /dev/null

real    12m4,800s
user    5m2,927s
sys     3m47,566s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f | parallel --progress md5sum > /dev/null

Computers / CPU cores / Max jobs to run
1:local / 36 / 36

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/268263/100%/0.0s

real    7m38,746s
user    10m26,780s
sys     9m43,082s

And hash computation is more or less the same depending on which number you take, but still above what it can be accomplished. DVC --> 660s/60=11m

Propositon

  1. Truly parallelism would come by the hand of a multiprocessing.Pool instead of ThreadPoolExecutor, also I believed that it would be best to use imap_unorder
  2. Maybe this is going too deep but looking at the os.walk code it first appends everything to a list and then it returns. Wouldn’t be better to use scandir() and yield?

Hi there! After almost one month the process finished.

dump.prof.gz

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I’ve read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

For the past experience async is a little bit usually slower than multi-processing.

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I’ve read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

It finished successfully, right? And you were using 2.5.4? Just clarifying

Indeed

What I can see from the dump right away is that stat is around 100 times slower than on my machine, which is usually because of nfs and in your case it probably is But we need to take a closer look.

Yes, the cache is located on an NFS.

(from logs, it looks like your dataset is ~1.7 million of files, right?)

Actually, it is 12,601,130

Good news here is that since the last time…

Cool, that’s great.

@alealv If you could add --cprofile-dump dump.prof to any of those commands and share that file here - that would be enough for us to take a closer look.

I haven’t run the same tests, but I can say that I ran dvc status (for the whole 5TB dataset) for more than 3 hours and it didn’t finish. Also dvc commit run for more than 5 days and I stopped it because I didn’t know if it was stalled or what.

It would be best if we coordinate on which tests to perform or create a specific repo to reproduce this.

Ok, so it looks like the status issue is actually separate. It’s still a dulwich/git issue, but related to gitignore performance.

dvc status in the DVC git repo:

Screen Shot 2021-03-11 at 2 36 46 PM

The add/repro performance issue should maybe go into its own GH issue, as it’s related to the git status/git-track reminder (although gitignore performance probably also affects those commands as well)

It’s not related to cloning. It’s pretty much done whenever we load stages and yes it’s what is also affecting the add performance. Or at least, the issue that should be resolved by the linked PR is definitely the cause of the add and repro performance issues. There maybe an unrelated status issue as well though?

Based on the discord discussion it’s probably because of https://github.com/iterative/dvc/pull/5544

We can revert the status check change for now, I’ll look into seeing if using pygit2’s status implementation is faster