moby: Docker build/push performance is inexplicably slow
Description
I am evaluating whether I could use docker as a reproducible build and runtime environment for a local cluster application. I’ve got it mostly working, except I’ve run into the unexpected issue that the build time became excruciatingly slow due to docker image build/push. I’ve isolated the problem to be related to sheer data volume.
Steps to reproduce the issue:
- Create test folder
$ mkdir testimage
- Generate 1GB incompressible file
$ openssl enc -aes128 -in /dev/zero -out >(dd bs=64K count=16384 iflag=fullblock of=testimage/data.img) -e -k $RANDOM -bufsize 134217728
- Generate test Dockerfile
$ echo "FROM scratch" > testimage/Dockerfile
$ echo "ADD data.img /data.img" >> testimage/Dockerfile
- Build the docker image
$ /usr/bin/time docker build -t testimage testimage
- Tag the image
$ docker tag testimage localhost:5000/testimage
- Push the image to the local registry
$ /usr/bin/time docker push localhost:5000/testimage
Describe the results you received: Step (4) output above looks like this:
Sending build context to Docker daemon 1.074 GB
Step 1 : FROM scratch
--->
Step 2 : ADD data.img /data.img
---> 9cacb55ce73d
Removing intermediate container 225630bb7d2e
Successfully built 9cacb55ce73d
1.61user 1.85system 0:39.16elapsed 8%CPU (0avgtext+0avgdata 20728maxresident)k
0inputs+0outputs (0major+1658minor)pagefaults 0swaps
Which works out to be about 25MB/s
Step (6) output above looks like this:
The push refers to a repository [localhost:5000/testimage]
fa2f5c3d71b4: Pushed
latest: digest: sha256:775f2fdf508b162f5ef26ee96292a9b8c1bbbabe4675e071ce02c1a815cb5845 size: 531
0.31user 0.14system 1:40.09elapsed 0%CPU (0avgtext+0avgdata 20172maxresident)k
0inputs+0outputs (0major+1663minor)pagefaults 0swaps
Which works out to be about 10MB/s
Describe the results you expected:
Well it’s basically a local copy + hash. There’s 64 cores with 2 threads each, 1.5TB of free RAM and nobody else on the system. I expected… a lot faster? Even if it’s single-threaded, I would expect maybe >300MB/s for build and >1GB/s for localhost push?
As a point of comparison:
$ sha1sum data.img
Takes 2.5 seconds
and
$ sha256sum data.img
Takes 5 seconds
Is this because of loop devices?
Additional information you deem important (e.g. issue happens only occasionally):
Just for a point of comparison, I decided to try something really not designed for handling binaries and do a similar experiment. I copied the data.img to an empty git repo, with a corresponding bare remote on a different computer on the local network.
$ git add data.img
Took 36s for a pitiful ~27MB/s
$ git commit -m "Test"
Took less than a second
$ git push srvr4 master
Took 33.8s for a ~29.5MB/s
So git add/commit/push over the local network is still faster than docker build/docker push locally, even though docker is a tool nominally designed for blobs of incompressible binary data and shoving those into git is probably the worst thing you can do to it.
Output of docker version
:
$ docker --version
Docker version 1.12.3, build 6b644ec
Output of docker info
:
$ docker info
Containers: 6
Running: 1
Paused: 0
Stopped: 5
Images: 3
Server Version: 1.12.3
Storage Driver: devicemapper
Pool Name: docker-8:2-3671813-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 11.92 GB
Data Space Total: 107.4 GB
Data Space Available: 95.46 GB
Metadata Space Used: 7.684 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.14 GB
Thin Pool Minimum Free Space: 10.74 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-lowlatency
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 1.476 TiB
Name: lfvs1
ID: PPFZ:4NG7:WN7F:HTAI:GH3M:UOYE:DFV4:THVD:LRKW:2W4Z:NSPU:TNSP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.):
Physical box with plenty of ram and CPU, and reasonably fast disks:
$ head -n1 /etc/issue
Ubuntu 16.04.1 LTS \n \l
$ uname -a
Linux lfvs1 4.4.0-47-lowlatency #68-Ubuntu SMP PREEMPT Wed Oct 26 21:00:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo | grep "model name" | cut -d ':' -f 2 | sort | uniq -c
128 Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz
$ free -h
total used free shared buff/cache available
Mem: 1.5T 5.5G 1.0T 19M 459G 1.5T
Swap: 0B 0B 0B
About this issue
- Original URL
- State: open
- Created 8 years ago
- Reactions: 30
- Comments: 46 (14 by maintainers)
any updates, been 2 years and the issue still exists.
buildah pushing is much faster than docker-cli because it uses “github.com/klauspost/pgzip” instead of “compress/gzip”.
According to the pgzip benchmark, pgzip is 123x faster than gzip.
Is there anyone interested to send a pr for moby?
@D0han @Tomasu I am generally addressing this problem with the approach taken in containerd but I am not sure how quickly we can realize those gains in docker.
The quickest fix would be to allow one to completely disable layer compression.
@groman2 There are a few things going on here. To understand push performance, you really need to understand what is happening. We’ll dive into that then we can look into what is going on with this machine, because it is very slow.
Pushing an image isn’t exactly straightforward but we can break down the process into pushing metadata and pushing layers. Since images are dominated by pushing layers, we’ll look at what happens with a single layer. Roughly, a layer push consists of the following steps:
For the most part, this is all done as a “cut-through” operation, meaning we tar directly into the compressor and directly hash the output, sending the hashed data to the registry. This gives the property that the bandwidth of the push is dominated by the slowest component. If everything is working correctly, disks, taring and hashing operate at roughly the same bandwidth, making the push process network limited. The tradeoff is that throughput in one pipeline stage can starve or back off other stages, resulting in slower performance.
For reference, we can measure a few of these steps to get a ballpark estimate of what we should be seeing (none of this is science). On my local machine (macbook pro w/SSD), I got the following as bandwidth numbers for various steps:
time curl -vv -H'Content-Type: application/octet-stream' --data-binary @<(tar cf - -C ~/go/src/github.com/docker/docker .) -XPUT $UPLOAD_URL
Note, that rather than using random data, I used real data from a 2GB checkout of docker to make the tar process even slower that what it would be in the example provided here. It includes a generated
data.img
, similar in size the file used in the original examples.While I agree that push should be faster, expecting >300 MB/s is very optimistic. With unloaded and unlimited network and disk, 100 MB/s is more realistic (and very close to the original numbers I got when do V2 protocol tests), depending on the machine.
Let’s also get more realistic push numbers with docker on my local box (docker for mac, with 1.12.3, notoriously slow fs performance). I have a local image of the
docker-dev
build, clocking in ~3GB that took roughly 6:36 to push to a registry running in docker for mac. We get about 8 MB/s. The same push to hub took about the same amount of time. With a different machine, also SSD, however, I get ~18 MB/s, when pushing from local docker to local registry. Clearly, the disk has an effect, but we also have unexplained performance degradation.Given the above, we can start to really see the pipeline affecting actual bandwidth but we can also see where the bottleneck lies. For the most part, compression and hashing are fixed on an unloaded machine. This leaves the problem at the disk or on the network. Since your push test is going to a local registry instance, we can assume the network is around 125MB/s or more, so it is likely that the disks speed is what is affecting the performance. The test data from my local machine confirms this, but also casts doubts about the theoretical limits of push performance.
TL; DR We can make a few conclusions from the data the analysis above:
/var/lib/docker
, reading directly from the same filesystem as docker. Having this data point will tell us if adjusting your disk setup can help with performance.Either way, this warrants more investigation. We’ll have to instrument each pipeline stage to see where the actual bottleneck is.
@stevvooe I see. I really don’t mind that too much. disk space is cheap on the registry for what I use it for.
I’m not familiar with the Docker codebase, but I have some familiarity with content addressable systems in general. In my experience, generating a hash of compressed data for content addressability can result in an eventual mess that is not related to performance. You are effectively locking yourself to a particular compression library version/compression settings/etc.
while we have a guarantee that
and
will give the same result on any system, we don’t have a guarantee that the compressed version will have the same SHA everywhere and every time (as long as it decompresses into the same thing, it’s all fine). Obviously it’s likely that it IS the same, given that gzip is pretty stable.
Using the sha of the uncompressed data for content addressability is a much better long term approach, IMO, even if compressed layers are supported.
@thaJeztah thank you! I had a try and it is a lot faster. I did not even notice that it has been pushed successfully. When do you think this feature would become General Availability, away from Beta stage?
The solution might benefit #26959
We’re working on completing the containerd image-store integration, which switches the image store to use the image-store and snapshotters provided by containerd.
With the containerd store integration, compressed (tar.gz) layers are persisted on disk (instead of constructed when pushing); this allows pushes to be reproducible (compression is not 100% reproducible), but at the cost of additional storage space required. If you have a system to test on, you can enable it in current versions of docker; see https://docs.docker.com/storage/containerd/#containerd-image-store-with-docker-engine
@thetredev Got it.
Nevertheless, this would be a very good performance gain if it is really x123 times faster with pgzip as per the claim. Hope the Moby team would push the priority of this ticket higher.
@Benjamin-Tan I would imagine: figuring out how to build moby, the Docker engine, and the CLI. Not sure if the Docker engine is open source (it’s free though), so you might be able to replace the Moby binaries from the engine with those you built yourself.
I guess I’m not the right person to ask though. It could very likely be the case that you have to build containerd instead of Moby, with the gzip patch applied to either Moby or containerd (don’t know how they work together, if at all). But if it’s only containerd then it’s easier because at least on Linux, the Docker CLI uses containerd AFAIK. Again, not sure what role Moby plays in all of this. It might be best to read through the containerd project first and go from there.
@thetredev How would one go about to use pgzip instead of gzip? while waiting for the Moby team to recognize this.
Still an issue with Docker 24.0.5. I second @umialpha.
@da2018 various reasons can be found in the docs here; https://docs.docker.com/storage/storagedriver/device-mapper-driver/#device-mapper-and-docker-performance
Unless you have a very specific reason, I’d recommend using the default (
overlay2
) storage driver. The device mapper storage driver was created when RHEL/CentOS kernels did not yet support overlayfs or other alternatives, but now that they do, overlay2 is used on those platforms as well. For those reasons, the device mapper storage driver is marked deprecated; there’s no immediate plans to remove it, but other than some maintenance if needed, no further development is planned.Also here, a 3 min (almost fully cached) build that takes up to 30 min for being pushed. It’s weird.
@Tomasu I am currently working on the option to enable saving of artifacts. Give containerd a try on a pull-push round trip and see if that impacts the push time and we can adjust. It is not quite production ready yet, but it will be a good data point to ensure the approach is right.
@groman2 I see you are very familiar with this trade off. The actual goal with this design was to make the storage layers agnostic to what content is being stored. When we introduced content addressability, the decision had already been made that layers are compressed artifacts, which was affirmed in the presence of possible, though improbable attacks, involving the processing of unverified content.
The root issue here is the expectation that a round trip of content through serialization layers provides a stable hash calculation. In your example, only
data.img
is considered. However, that may include json generation, consistent tar creation (see tar-split) and consistent compression. Whether you have choice of compression or not isn’t particularly important when the round tripping of artifacts can introduce hash instability. In particular, bugs within any of these components can create problems with the round trip. Generating the artifact once usually assuages these problems and is very much a viable solution in this case.Put in your terms, I expand the warning of “generating a hash of compressed data for content addressability can result in an eventual mess” to the stronger “generating hash stable round trip serialization for content addressability will result in an eventual mess”. 😉
The implication here is that compression is actually a transport concern. Especially since we do it on every push and pull. Going back to our bandwidth pipeline analysis, if the compression bandwidth is higher than the connection bandwidth, we have the property that the transfer time (there is a relationship that can be worked out mathematically). For most internet connections and CPU sizes, gzip is adequate because it can keep the pipe full. When the CPU size (lower compression bandwidth) is too small or the connection is too fast (like in this case), the low compression bandwidth starves the network connection, resulting in a slow push. Because the performance of compression is variant on environmental conditions, it should probably be separated from content identity.
I hope my statements above give you the feeling not that I disagree with you, but rather that I am affirming your intuition with experience we’ve had in the deployment of content addressability in Docker.
Given the above, there are too paths:
Both can be done, but we can favor 2 before 1, since there will be less impact. The takeaway is that I am going to pursue these much less casually than before, since we actually have some theory backing it up.
@groman2 Perfect! This data helps out a lot with the analysis. I think we can eliminate the hypothesis that your “disks are slow”.
Throwing out gzip in the analysis above was a HUGE omission. Apologies for the oversight. I’ve confirmed the bandwidth numbers of gzip, getting anywhere from 10-40 MB/s, even with pigz, on my local machine (I only have a few cores, so this makes sense).
For docker, parallel gzip presents some serious performance issues. Most benchmarks assume that the process will have the whole machine at their disposal. However, with applications running and other layers being push/pulled concurrently, many of these parallel gains will be diminished. We really need to optimize for single threaded performance and gzip seems to be a massive bottleneck.
Looking back, gzip compression has always been a controversial topic. Checkout https://github.com/docker/docker/issues/10181 and https://github.com/docker/docker/pull/17674. The approach previously has always been about optimizing gzip, usually at the cost of more resources. Unfortunately, none of these analysis approached the problem from a theoretical pipeline perspective.
The main reason we cannot just turn gzip on and off is that the compression is part of the content addressable artifact. If you disable compression, it will change the content addresses of existing content.
We’ll have to make a decision to as a whole and have a plan for phasing out compressed layers. This will get rid of a major bottleneck and allow operators to choose a fitting transport compression that works well in their infrastructure.
Another option is to compress each layer once, rather than on every single push and pull.