dvc: collecting checksum for an external dir dependency takes too much time
DVC version:
DVC version: 0.71.0+d42a04
Python version: 3.7.5
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')
Reproduce:
python -m cProfile -o s3-dep-images.prof -m dvc run -v -d s3://dvc-common/images -o out.txt "aws s3 ls s3://dvc-storage/get-started/fe/ >out.txt"
takes 3-4 hours to complete.
s3://dvc-common/images “Directory” has ~0.5M files.
Profile file:
https://www.dropbox.com/s/uqdjmm4pjkn0rj1/s3-dep-images.prof?dl=0
96% of time goes to thread lock.
Output:
Mon Dec 2 23:11:43 2019 s3-dep-images.prof
325633688 function calls (322461350 primitive calls) in 10724.500 seconds
Ordered by: internal time
List reduced from 7431 to 100 due to restriction <100>
ncalls tottime percall cumtime percall filename:lineno(function)
596400 10371.995 0.017 10371.995 0.017 {method 'acquire' of '_thread.lock' objects}
23894 152.922 0.006 152.922 0.006 {method 'read' of '_ssl._SSLSocket' objects}
5782566 12.714 0.000 21.243 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:83(get_token)
4475479 9.278 0.000 10.816 0.000 {method 'write' of '_io.TextIOWrapper' objects}
444810 5.790 0.000 52.679 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:669(_parse)
459 5.261 0.011 5.261 0.011 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
16273231 5.019 0.000 8.242 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:300(<genexpr>)
440822 4.726 0.000 5.103 0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:216(__init__)
440796 4.490 0.000 10374.041 0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:408(result)
134568 4.240 0.000 28.491 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:309(format_meter)
1352110/1352109 4.128 0.000 5.655 0.000 {method 'format' of 'str' objects}
889629 3.983 0.000 13.170 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:878(_parse_numeric_token)
297700 3.369 0.000 10376.913 0.035 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:264(wait)
269133 3.349 0.000 4.986 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:209(_is_ascii)
445259/463 3.343 0.000 83.457 0.180 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:362(_handle_structure)
16138665 3.222 0.000 3.222 0.000 {built-in method unicodedata.east_asian_width}
579377 2.974 0.000 12.928 0.000 {built-in method builtins.sum}
1334439 2.732 0.000 2.732 0.000 {method 'replace' of 'datetime.datetime' objects}
2203982 2.699 0.000 5.480 0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:183(parts)
440797 2.623 0.000 10418.901 0.024 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:1061(__iter__)
2673319/521 2.482 0.000 83.474 0.160 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:299(_parse_shape)
440797 2.433 0.000 10376.699 0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:591(result_iterator)
441675 2.391 0.000 2.621 0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:700(__str__)
1334430 2.356 0.000 3.138 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:434(append)
2807755 2.261 0.000 2.261 0.000 {method 'sub' of 're.Pattern' objects}
12454653 2.224 0.000 3.369 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:214(isnum)
881628 1.954 0.000 4.412 0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:63(parse_parts)
440797 1.850 0.000 8.336 0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py:145(submit)
2203982 1.828 0.000 7.840 0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:209(__hash__)
5782566 1.813 0.000 23.057 0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:195(__next__)
11596386 1.809 0.000 1.811 0.000 {built-in method builtins.getattr}
1 1.726 1.726 10433.537 10433.537 /Users/ivan/Projects/dvc/dvc/remote/base.py:180(_calculate_checksums)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 16 (14 by maintainers)
@menshikh-iv right! thanks. I doubt that it’s GIL bounded (it should be IO mostly - HEAD s3://X/file). I’ll check yappi or something similar.
Also, was able to speed it up 8-10x by passing
max_pool_connections
to S3 client:in
remote/s3.py
, otherwise it is bounded by 10 simultaneous connections.I wonder what else is bounded by this limit - status collection?
Ok, more results on this stuff:
max_pool_connections
not the reason. It might affect the performance but not 7x-8x on Linux and Mac. (So, no significant changes tostatus
,pull
,push
, etc. Need to check if there are small improvements - was not focusing on that.)--jobs 100
(forpush/pull/status
) I don’t see 100 ESTABLISHED connections - might mean that we indeed hit GIL - will need to run yappi or strace.remote/base.py
:config
is a remote sectionremote_config = RemoteConfig(repo.config)
not full config, socore
section is always empty andchecksum_jobs
is always default - around 2 or 4. Which in case of collecting them from S3 ETags is way too small - it can be set to 100.Also, it’s obvious that requirements are so different here that we can’t use the same setting and the same defaults - for calculating checksums locally (CPU) vs collecting them from S3 (IO).
@shcheklein
cProfile
isn’t a good choice for multithread profiling (cProfile
doesn’t support it without “hacks”), for that case you can useAlso, you can try to use
strace
tool and get aggregated information about syscalls (esp. for locks), because this can be a GIL issue (i.e. program spend most of the time on GIL lock/release or waiting for GIL). This will be something likestrace -c -fp <MAIN_PROCESS_PID>
.Hi ! We’re having the same problem as you can see in the profiles in the attached photo, where
dvc pull
spends a lot of time on Querying remote cache, and when looking at the CProfile it looks like the majority of that time is spent on a thread.lock function. Our dataset is not that big, about a 1000 files of ~5 mb in average, but they are distributed in different folders, and also we’re tracking most of them from the yaml stages. Don’t know if this issue is still unresolved, or if I’m doing something wrong.Also note there are 596,400 calls which must be roughly the number of files to calculate sums for.
I don’t think it does - it’s captured the main thread mostly waiting for the worker results.