moby: Docker does not extract images in parallel
Feature request
Description of problem: Docker does not extract images in parallel
docker version
:
Client:
Version: 1.10.3
API version: 1.22
Go version: go1.5.3
Git commit: 20f81dd
Built: Thu Mar 10 15:54:52 2016
OS/Arch: linux/amd64
Server: Version: 1.10.3 API version: 1.22 Go version: go1.5.3 Git commit: 20f81dd Built: Thu Mar 10 15:54:52 2016 OS/Arch: linux/amd64
docker info
:
Containers: 23
Running: 0
Paused: 0
Stopped: 23
Images: 15
Server Version: 1.10.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 303
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 4.2.0-30-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.945 GiB
Name: something-else
ID: 37YL:PW3M:M5GE:PSQ5:TMHR:QCFI:M64D:53RJ:2LOA:CLU2:BXUR:X2RG
WARNING: No swap limit support
uname -a
:
Linux something-else 4.2.0-30-generic #36~14.04.1-Ubuntu SMP Fri Feb 26 18:49:23 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Environment details (AWS, VirtualBox, physical, etc.): physical
How reproducible: very easy
Steps to Reproduce:
- docker run some image you dont have
- notice the images are extracted one by one. This could be Parallelized
Actual Results: images are extracted one by one
Expected Results: images are extracted in parallel
Additional info: None
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 18 (9 by maintainers)
I don’t know about y’all, but in
Docker 18.09.0, build 4d60db4
I’m seeing:on a 16 core machine.
Naively, “extract” implies decompressing. While one needs to mount the layers in sequential order for the reasons @unclejack mentioned, it doesn’t immediately follow that the layers must be also decompressed sequentially. Would it be possible to extract in parallel, and have only the mount operation sequential?
Use case: Pulling a 5GB image on a fresh 32 core VM in the cloud [GCE / GCR] takes more than 3 minutes, most of them spent “extracting”. There is nothing else to do until the image is pulled, so we end up with 31 cores sitting idle. Note that the “download” part is actually quite snappy.
We also don’t really want a
pull
operation consuming tons of CPU on a host with running containers.Note that the next release will use
pigz
(parallel decompression) for layers; https://github.com/moby/moby/pull/35697I agree wholeheartedly.
In private data centers, or even in public cloud services that provide their own container repositories (e.g. Google Cloud Platform),
docker pull
s of large images can definitely be decompression-limited than network-limited. Even after the migration topigz
, which unlikepbzip2
cannot speed up decompression much, both continuous integration and deployment flows that start with cold image caches are severely hampered by initialdocker pull
completion latencies.While I understand that the general target of containerization is lightweight containers, where image sizes should naturally be on the smaller end of the spectrum, many of the benefits still carry-over to some applications that inherently have very large container sizes (i.e. not due to inclusion of “data” in containers). The image/layer decompression performance issues go quite a ways back in GitHub issue history and seem to all eventually go cold…
Right now, as things stand, the layer compression is still “hard-coded” to use the internal/
pigz
implementation which makes it difficult to inject an external compression helper (e.g.pbzip2
) for private/custom usage so that the pull path would be forced to use a different decompression (other thanpigz
). And given that layers are decompressed serially, there’s no way to take advantage of parallelism at that level, either.What do you think of a parallel flag to specify how many packages to extract at once, that knows if your storage driver supports parallel extraction?
Peace
Harsh Singh
On Thu, Apr 7, 2016 at 12:20 PM, Brian Goff notifications@github.com wrote:
This isn’t going to happen for technical reasons. There’s no room for debate here. AUFS would support this, but most of the other storage drivers wouldn’t support this. This also requires having specific code to implement at least two different code paths: one with this parallel extraction and one without it.
An image is basically something like this graph A->B->C->D and most Docker storage drivers can’t handle extracting any layers which depend on layers which haven’t been extracted already.
Should you want to speed up
docker pull
, you most certainly want faster storage and faster network. Go itself will contribute to performance gains once Go 1.7 is out and we start using it.I’m going to close this right now because any gains from parallel extraction for specific drivers aren’t worth the complexity for the code, the effort needed to implement it and the effort needed to maintain this in the future.
I doubt it would make sense in most cases. You’d have to decompress to temporary storage, then possibly copy the decompressed files on top of the previous layer when possible. It would be a lot of extra I/O load to allow some extra concurrency on decompression.