dvc: DVC commands hanging forever
When launching a command such as dvc status
or dvc pull
, the command seem to hang forever.
My setup
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
$ which python
/home/myself/anaconda3/envs/my_env/bin/python
$ which dvc
/home/myself/anaconda3/envs/my_env/bin/dvc
$ which pip
/home/myself/anaconda3/envs/my_env/bin/pip
$ pip freeze | grep dvc
dvc==0.32.1
$ dvc --version
0.32.1
$ python --version
Python 3.6.7
My repo that was initialized with DVC 1 month ago
$ git rev-parse HEAD # My latest git commit
b1c0935fbdb99304f99a6ee1d45560a39cd303f7
$ git status
On branch master
Your branch is up to date with 'origin/master'.
It took 2.01 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
nothing to commit, working tree clean
$ tree -a .dvc | tail -n 10
│ ├── fc33dca0a10f6cd657b9c688356dc1
│ └── fe15f39295fea2f0ca3bd4fad98ea3
├── config
├── .gitignore
├── lock
├── state
├── updater
└── updater.lock
257 directories, 53856 files
$ cat .dvc/config .dvc/lock .dvc/updater .dvc/updater.lock
['remote "upstream"']
url = s3://mybucket/mydir
[core]
remote = upstream
26286
{"version": "0.32.1", "packages": {"linux": {"deb": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc_0.32.1_amd64.deb", "rpm": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc-0.32.1-1.x86_64.rpm"}, "windows": {"exe": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc-0.32.1.exe"}, "osx": {"pkg": "https://github.com/iterative/dvc/releases/download/0.32.1/dvc-0.32.1.pkg"}}} 26286
$ du -chs .; find . | grep '\.dvc$' | wc; find . | grep -E '\.(JPG|jpg)$' | wc
240G .
240G total
54022 54024 3610085
54021 54023 3393994
$ dvc status -v
Debug: PRAGMA user_version;
Debug: fetched: [(3,)]
Debug: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
Debug: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
Debug: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
Debug: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
Debug: PRAGMA user_version = 3;
# Hangs forever using one cpu
Try again from scratch
$ git clone git@github.com:my/repo.git
$ cd repo
$ git rev-parse HEAD # My latest git commit
b1c0935fbdb99304f99a6ee1d45560a39cd303f7
$ tree -a .dvc | tail -n 10
.dvc
├── config
└── .gitignore
0 directories, 2 files
$ cat .dvc/config
['remote "upstream"']
url = s3://mybucket/mydir
[core]
remote = upstream
$ du -chs .; find . | grep '\.dvc$' | wc; find . | grep -E '\.(JPG|jpg)$' | wc
159M .
159M total
54022 54024 3610085
0 0 0
$ dvc status -v
Debug: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
Debug: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
Debug: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
Debug: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
Debug: PRAGMA user_version = 3;
# Hangs forever using one cpu
EDIT: My s3 bucket is up
aws s3 ls mybucket/mydir/ | tail
PRE f6/
PRE f7/
PRE f8/
PRE f9/
PRE fa/
PRE fb/
PRE fc/
PRE fd/
PRE fe/
PRE ff/
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (12 by maintainers)
@efiop is there any performance testing or benchmarks for different dvc repo sizes? any issue for it? it seems quite critical as of my point of view
It was indeed the problem. I fixed the repo that way:
I then updated the repo on another computer:
dvc push
was quick, it was a good surprise. I understand whydvc add */Images
was long since all files had to be hashed again. Some commands are still a few minutes long, is it normal?dvc pull
on the second computer re-downloaded the cache without creating hard-links with the existing files, the directory size doubled in the process. It also performed themd5
hashing, even though the cache was just downloaded.Hi @Ngoguey42 !
If we are talking about
dvc status
, then yes, seems a little too long. Mind running it again (justdvc status
) there to see how long seconddvc status
takes?Second checksum computations are needed to make sure that everything is in order after you’ve pulled your cache.
Let me guess, is it mac that you are using? 🙂 If so, then most likely there are reflinks being used, which don’t register well with du, so it might show you twice the space usage, when actually if you take a look at free space on your drive it will be like usage wasn’t actually doubled 😃
@Ngoguey42 just out of curiosity, if have 5 minutes, could you also briefly describe your use case please? It’s just not a common thing for us to see 50K dvc stage files 😃
@ei-grad thanks!