dvc: directory download and uploads are slow
Bug Report
- This affects multiple commands,
get
/fetch
/pull
/push
… so I didn’t put a tag.
Description
We have added a directory containing 70,000 small images to the Dataset Registry. There is also a tar.gz
version of the dataset which is downloaded quickly:
time dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz
dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz 3.41s user 1.36s system 45% cpu 10.411 total
When I issue:
dvc get https://github.com/iterative/dataset-registry mnist/images
I get ~16 hours ETA for 70.000 downloads in my VPS.
This is reduced to ~3 hours on my faster local machine.
I didn’t wait to finish these, so the real times may be different but you get the idea.
For -j 10
it doesn’t differ much:
dvc pull
is better, it’s takes about 20-25 minutes.
(At this point, while writing a new version released and the rest of the report is in 2.4.1
😄 )
dvc pull -j 100
seems to reduce the ETA to 10 minutes.
(I waited for dvc pull -j 100
to finish and it took ~15 minutes.)
I also had this issue while uploading the data in iterative/dataset-registry#18 and we have a discussion there.
Reproduce
git clone https://github.com/iterative/dataset-registry
cd dataset-registry
dvc pull mnist/images.dvc
or
dvc get https://github.com/iterative/dataset-registry mnist/images
Expected
We will use this dataset (and fashion-mnist
similar to this) in example repositories, we would like to have some acceptable time (<2 minutes) for the whole directory to download.
Environment information
Output of dvc doctor
:
Some of this report is with 2.3.0
but currently:
$ dvc doctor
DVC version: 2.4.1 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-74-generic-x86_64-with-glibc2.29
Supports: azure, gdrive, gs, hdfs, webhdfs, http, https, s3, ssh, oss
Discussion
DVC uses new requests.Session
objects in connection and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time.
There is a mechanism in HTTP/1.1 to use the same connection. but requests
doesn’t support it..
Note that increasing the number of jobs doesn’t make much difference, because servers usually limit the number of connections per IP. Even if you have 100 threads/processes to download, it’s probably a small number (~4-8) of these can be connected at a time. (I’m banned from AWS once while testing the commands with large -j
.)
There may be 2 solutions for this:
-
DVC can consider directories as implicit
tar
archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them incheckout
.tar
andgzip
are supported in Python standard library. This probably requires allRepo
class to be updated though. -
Instead of
requests
, DVC can use a custom solution or another library likedugong
that supports HTTP pipelining. I didn’t test any HTTP pipelining solution in Python, so I can’t vouch for any of them but this may be better for all asynchronous operations using HTTP(S).
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 2
- Comments: 28 (18 by maintainers)
Removing the p1 label here because this is clearly not going to be closed in the next sprint or so, which should be the goal for a p1. However, it’s still a high priority that we will continue to be working to improve.
TLDR: It looks like the multiple processes launched by s5cmd do improve performance, whereas the additional threads in dvc don’t seem to be helping much.
s5cmd --numworkers 100
finishes in about 5 minutes for me on multiple tries. Here’s a random snapshot of cpu and network activity while files were downloading:dvc pull -j 100
varied more with multiple attempts but took up to 20 minutes. Here are cpu and network activity snapshots:We discussed this in planning, and there are a few different items to address in this issue:
Difference between
get
andpull
- this will hopefully be addressed soon.Optimization of
http
- for now, it seems like performance ofhttp
is not too different froms3
, so probably not worth pursuing this now.Optimization of download operations for directories -
I did a dirty performance comparison:
dvc pull -j 10
took 22 minutes for ~70k files.s5cmd --numworkers=10 cp "s3://dvc-public/remote/dataset-registry/*" download
took 21 minutes for ~140k files.So my rough estimate is that
dvc pull
is about 2x slower thans5cmd
here. It’s worth looking into why and hopefully optimizing further, but we shouldn’t expect to get this under ~10 minutes for now at best.My comparison between
dvc pull
ands5cmd
is similar to Dave’s, but more of like 3x difference:s5cmd cp 's3://dvc-public/remote/get-started-experiments/*'
takes 11:35dvc pull -j 100
takes 33:19These are with 70.000 identical files.
The default
--numworkers
fors5cmd
is 256. In the above experiment it’s set to 10.I think that’s obsoleted by rfc7230.
I can test HTTP pipelining with S3 with a script, if you’d like. If I can come up with faster download with a single thread, we can talk about implementing it in the core.
DVC push/pull performance for many small files is a known limitation. But a part of the issue is probably also specific to the HTTP remote type, it’s possible you would get better performance pulling from the registry using S3 rather than HTTP (due to remote fs implementation differences)
For the purpose of the example projects, it may be better to just handle compressing/extracting images as an archive within the example pipelines.
This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions.