dvc: DVC commands hanging forever

When launching a command such as dvc status or dvc pull, the command seem to hang forever.

My setup

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

$ which python
/home/myself/anaconda3/envs/my_env/bin/python

$ which dvc   
/home/myself/anaconda3/envs/my_env/bin/dvc

$ which pip   
/home/myself/anaconda3/envs/my_env/bin/pip

$ pip freeze | grep dvc
dvc==0.32.1

$ dvc --version
0.32.1

$ python --version
Python 3.6.7

My repo that was initialized with DVC 1 month ago

$ git rev-parse HEAD # My latest git commit
b1c0935fbdb99304f99a6ee1d45560a39cd303f7

$ git status
On branch master
Your branch is up to date with 'origin/master'.


It took 2.01 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
nothing to commit, working tree clean

$ tree -a .dvc | tail -n 10
│       ├── fc33dca0a10f6cd657b9c688356dc1
│       └── fe15f39295fea2f0ca3bd4fad98ea3
├── config
├── .gitignore
├── lock
├── state
├── updater
└── updater.lock

257 directories, 53856 files

$ cat .dvc/config .dvc/lock .dvc/updater .dvc/updater.lock
['remote "upstream"']
url = s3://mybucket/mydir
[core]
remote = upstream
 26286
{"version": "0.32.1", "packages": {"linux": {"deb": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc_0.32.1_amd64.deb", "rpm": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc-0.32.1-1.x86_64.rpm"}, "windows": {"exe": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc-0.32.1.exe"}, "osx": {"pkg": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc-0.32.1.pkg"}}} 26286

$ du -chs .; find . | grep '\.dvc$' | wc; find . | grep -E '\.(JPG|jpg)$' | wc 
240G	.
240G	total
  54022   54024 3610085
  54021   54023 3393994

$ dvc status -v
Debug: PRAGMA user_version;
Debug: fetched: [(3,)]
Debug: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
Debug: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
Debug: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
Debug: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
Debug: PRAGMA user_version = 3;

# Hangs forever using one cpu

Try again from scratch

$ git clone git@github.com:my/repo.git
$ cd repo
$ git rev-parse HEAD # My latest git commit
b1c0935fbdb99304f99a6ee1d45560a39cd303f7

$ tree -a .dvc | tail -n 10
.dvc
├── config
└── .gitignore

0 directories, 2 files

$ cat .dvc/config
['remote "upstream"']
url = s3://mybucket/mydir
[core]
remote = upstream

$ du -chs .; find . | grep '\.dvc$' | wc; find . | grep -E '\.(JPG|jpg)$' | wc 
159M	.
159M	total
  54022   54024 3610085
      0       0       0

$ dvc status -v
Debug: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
Debug: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
Debug: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
Debug: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
Debug: PRAGMA user_version = 3;

# Hangs forever using one cpu

EDIT: My s3 bucket is up

aws s3 ls mybucket/mydir/ | tail
                           PRE f6/
                           PRE f7/
                           PRE f8/
                           PRE f9/
                           PRE fa/
                           PRE fb/
                           PRE fc/
                           PRE fd/
                           PRE fe/
                           PRE ff/

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (12 by maintainers)

Most upvoted comments

@efiop is there any performance testing or benchmarks for different dvc repo sizes? any issue for it? it seems quite critical as of my point of view

It was indeed the problem. I fixed the repo that way:

# Copy repo, just in case
$ cp -r repo repo2 
$ cd repo2

# Purge dvc manually since `dvc destroy` is slow too
$ git rm -r .dvc/
$ rm -rf .dvc
$ find . -type f | grep -E '\.(gitignore|dvc)$' | xargs -L500 git rm

# Restore dvc
$ dvc init
$ dvc remote add -d upstream s3://mybucket/mydir
$ time dvc add */Images
dvc add */Images  616,18s user 609,28s system 15% cpu 2:15:50,26 total

$ du -chs .; find . | grep '\.dvc$' | wc; find . | grep -E '\.(JPG|jpg)$' | wc 
233G	.
233G	total
     56      56    2452
  54020   54020 3393904

$ time dvc push
dvc push  93,26s user 4,82s system 79% cpu 2:02,61 total

$ time dvc status
Pipeline is up to date. Nothing to reproduce.
dvc status  33,50s user 38,46s system 45% cpu 2:36,48 total

$ time dvc pull
dvc pull  137,80s user 42,13s system 50% cpu 5:53,27 total

$ git add/commit/push

I then updated the repo on another computer:

$ git pull
$ time dvc pull
real	247m24,327s # 4.5 hours
user	66m34,904s
sys	23m35,167s

$ time dvc pull
real	3m24,860s
user	2m8,636s
sys	0m10,028s

$ du -chs .; find . | grep '\.dvc$' | wc; find . | grep -E '\.(JPG|jpg)$' | wc 
464G	.
464G	total
     56      56    2452
  54020   54020 3393904

$ time dvc status
Pipeline is up to date. Nothing to reproduce.

real	0m23,559s
user	0m17,585s
sys	0m6,256s

dvc push was quick, it was a good surprise. I understand why dvc add */Images was long since all files had to be hashed again. Some commands are still a few minutes long, is it normal?

dvc pull on the second computer re-downloaded the cache without creating hard-links with the existing files, the directory size doubled in the process. It also performed the md5 hashing, even though the cache was just downloaded.

Hi @Ngoguey42 !

Some commands are still a few minutes long, is it normal?

If we are talking about dvc status, then yes, seems a little too long. Mind running it again (just dvc status) there to see how long second dvc status takes?

dvc pull on the second computer re-downloaded the cache without creating hard-links with the existing files, the directory size doubled in the process. It also performed the md5 hashing, even though the cache was just downloaded.

Second checksum computations are needed to make sure that everything is in order after you’ve pulled your cache.

Let me guess, is it mac that you are using? 🙂 If so, then most likely there are reflinks being used, which don’t register well with du, so it might show you twice the space usage, when actually if you take a look at free space on your drive it will be like usage wasn’t actually doubled 😃

@Ngoguey42 just out of curiosity, if have 5 minutes, could you also briefly describe your use case please? It’s just not a common thing for us to see 50K dvc stage files 😃

@ei-grad thanks!