dvc: collecting checksum for an external dir dependency takes too much time

DVC version:

DVC version: 0.71.0+d42a04
Python version: 3.7.5
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

Reproduce:

python -m cProfile -o s3-dep-images.prof -m dvc run -v -d s3://dvc-common/images -o out.txt "aws s3 ls s3://dvc-storage/get-started/fe/ >out.txt"

takes 3-4 hours to complete.

s3://dvc-common/images “Directory” has ~0.5M files.

Profile file:

https://www.dropbox.com/s/uqdjmm4pjkn0rj1/s3-dep-images.prof?dl=0

96% of time goes to thread lock.

Output:

Mon Dec  2 23:11:43 2019    s3-dep-images.prof

         325633688 function calls (322461350 primitive calls) in 10724.500 seconds

   Ordered by: internal time
   List reduced from 7431 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   596400 10371.995    0.017 10371.995    0.017 {method 'acquire' of '_thread.lock' objects}
    23894  152.922    0.006  152.922    0.006 {method 'read' of '_ssl._SSLSocket' objects}
  5782566   12.714    0.000   21.243    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:83(get_token)
  4475479    9.278    0.000   10.816    0.000 {method 'write' of '_io.TextIOWrapper' objects}
   444810    5.790    0.000   52.679    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:669(_parse)
      459    5.261    0.011    5.261    0.011 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
 16273231    5.019    0.000    8.242    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:300(<genexpr>)
   440822    4.726    0.000    5.103    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:216(__init__)
   440796    4.490    0.000 10374.041    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:408(result)
   134568    4.240    0.000   28.491    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:309(format_meter)
1352110/1352109    4.128    0.000    5.655    0.000 {method 'format' of 'str' objects}
   889629    3.983    0.000   13.170    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:878(_parse_numeric_token)
   297700    3.369    0.000 10376.913    0.035 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:264(wait)
   269133    3.349    0.000    4.986    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:209(_is_ascii)
445259/463    3.343    0.000   83.457    0.180 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:362(_handle_structure)
 16138665    3.222    0.000    3.222    0.000 {built-in method unicodedata.east_asian_width}
   579377    2.974    0.000   12.928    0.000 {built-in method builtins.sum}
  1334439    2.732    0.000    2.732    0.000 {method 'replace' of 'datetime.datetime' objects}
  2203982    2.699    0.000    5.480    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:183(parts)
   440797    2.623    0.000 10418.901    0.024 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:1061(__iter__)
2673319/521    2.482    0.000   83.474    0.160 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:299(_parse_shape)
   440797    2.433    0.000 10376.699    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:591(result_iterator)
   441675    2.391    0.000    2.621    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:700(__str__)
  1334430    2.356    0.000    3.138    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:434(append)
  2807755    2.261    0.000    2.261    0.000 {method 'sub' of 're.Pattern' objects}
 12454653    2.224    0.000    3.369    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:214(isnum)
   881628    1.954    0.000    4.412    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:63(parse_parts)
   440797    1.850    0.000    8.336    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py:145(submit)
  2203982    1.828    0.000    7.840    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:209(__hash__)
  5782566    1.813    0.000   23.057    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:195(__next__)
 11596386    1.809    0.000    1.811    0.000 {built-in method builtins.getattr}
        1    1.726    1.726 10433.537 10433.537 /Users/ivan/Projects/dvc/dvc/remote/base.py:180(_calculate_checksums)

Artboard 1

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 16 (14 by maintainers)

Most upvoted comments

@menshikh-iv right! thanks. I doubt that it’s GIL bounded (it should be IO mostly - HEAD s3://X/file). I’ll check yappi or something similar.

Also, was able to speed it up 8-10x by passing max_pool_connections to S3 client:

return session.client(
            "s3", endpoint_url=self.endpoint_url, use_ssl=self.use_ssl,
            config=botocore.config.Config(max_pool_connections=100)
        )

in remote/s3.py, otherwise it is bounded by 10 simultaneous connections.

I wonder what else is bounded by this limit - status collection?

Ok, more results on this stuff:

  • max_pool_connections not the reason. It might affect the performance but not 7x-8x on Linux and Mac. (So, no significant changes to status, pull, push, etc. Need to check if there are small improvements - was not focusing on that.)
  • when we do --jobs 100 (for push/pull/status) I don’t see 100 ESTABLISHED connections - might mean that we indeed hit GIL - will need to run yappi or strace.
  • the actual bug is here remote/base.py:
        core = config.get(Config.SECTION_CORE, {})
        self.checksum_jobs = core.get(
            Config.SECTION_CORE_CHECKSUM_JOBS, self.CHECKSUM_JOBS
        )

config is a remote section remote_config = RemoteConfig(repo.config) not full config, so core section is always empty and checksum_jobs is always default - around 2 or 4. Which in case of collecting them from S3 ETags is way too small - it can be set to 100.

Also, it’s obvious that requirements are so different here that we can’t use the same setting and the same defaults - for calculating checksums locally (CPU) vs collecting them from S3 (IO).

@shcheklein cProfile isn’t a good choice for multithread profiling (cProfile doesn’t support it without “hacks”), for that case you can use

Also, you can try to use strace tool and get aggregated information about syscalls (esp. for locks), because this can be a GIL issue (i.e. program spend most of the time on GIL lock/release or waiting for GIL). This will be something like strace -c -fp <MAIN_PROCESS_PID>.

Hi ! We’re having the same problem as you can see in the profiles in the attached photo, where dvc pull spends a lot of time on Querying remote cache, and when looking at the CProfile it looks like the majority of that time is spent on a thread.lock function. Our dataset is not that big, about a 1000 files of ~5 mb in average, but they are distributed in different folders, and also we’re tracking most of them from the yaml stages. Don’t know if this issue is still unresolved, or if I’m doing something wrong.

image

Also note there are 596,400 calls which must be roughly the number of files to calculate sums for.

I don’t think it does - it’s captured the main thread mostly waiting for the worker results.