moby: Docker is incredibly slow to start many containers in parallel even on a many-core machine
Description
On a many-core machine (Ryzen 9 3950X in my case), starting 32 docker containers simultaneously results in the time needed to start any one of them increasing 10-20 fold.
Steps to reproduce the issue: 1.
On a multi-core machine, start a bunch of docker containers running /bin/true
(which does nothing but exit with status 0, thus time needed to run the command should be negligible) in parallel. This can be done with GNU Parallel as follows:
parallel -j32 -N0 docker run --rm alpine /bin/true ::: {1..100}
In this case I’m starting starting 32 with -j32
as I’m on a 32 threaded machine.
Immediately thereafter, in another terminal run /bin/true
in another docker container and measure the time it takes to complete:
time docker run --rm alpine /bin/true
After all containers from step 1 have exited, run the command in step 2 again and compare the times.
In my case, when running /bin/true
with 32 other containers starting at the same time, I got:
real 0m7.399s
user 0m0.024s
sys 0m0.019s
When running without the other containers starting at the same time, I got:
real 0m0.603s
user 0m0.030s
sys 0m0.015s
Describe the results you received:
Time needed in step 2 was vastly greater than time needed in step 3.
Describe the results you expected:
Times in step 2 and 3 are comparable given container creation processes should be scheduled on different cores.
Additional information you deem important (e.g. issue happens only occasionally):
Thinking dockerd
may be serializing around a lock or something of the sort, I generated two off CPU flame graphs of the dockerd
process using tools from Brendan Gregg. They can be found zipped here: flamegraphs.zip. offcpu-dockerd-parallel.svg
covers docker being used to start 32 containers in parallel and offcpu-dockerd-sequential.svg
covers docker being to start 32 containers sequentially. As can be seen in the parallel case, dockerd
spends a very large amount of time being blocked in the open
syscall (see right hand side of graph), specifically when waiting for the other end of a FIFO to be opened. This is not present in the sequential case.
Running lsof
on dockerd during the parallel test, I consistently got output resembling the following:
$ sudo lsof -p `pgrep -nx dockerd` | grep FIFO
dockerd 10755 root 26u FIFO 0,23 0t0 77979 /run/docker/containerd/59335fc9195f9e4a23ae8ad6ac3b5539ba4f784053bc5dc2305731f405ad18b2/init-stdout
dockerd 10755 root 31u FIFO 0,23 0t0 78015 /run/docker/containerd/e4869e1ea909edf9e4fe29a148014ac0d24b437dfe8e64b0b0aac74665f64bb4/init-stdout
dockerd 10755 root 49u FIFO 0,23 0t0 77980 /run/docker/containerd/59335fc9195f9e4a23ae8ad6ac3b5539ba4f784053bc5dc2305731f405ad18b2/init-stderr
dockerd 10755 root 114u FIFO 0,23 0t0 78003 /run/docker/containerd/807fc9cef8a4834e3a6f671837e54c3bb3afad8ccbd0b491716c44ca08dba2fe/init-stdout
dockerd 10755 root 117u FIFO 0,23 0t0 78004 /run/docker/containerd/807fc9cef8a4834e3a6f671837e54c3bb3afad8ccbd0b491716c44ca08dba2fe/init-stderr
dockerd 10755 root 132u FIFO 0,23 0t0 78016 /run/docker/containerd/e4869e1ea909edf9e4fe29a148014ac0d24b437dfe8e64b0b0aac74665f64bb4/init-stderr
ie. the FIFOs being waited on seem to be /run/docker/containerd/<container_id>/init-[stderr,stdout]
.
Some googling resulted in me stumbling upon https://github.com/moby/moby/issues/29369, which seems to be a very similar situation. This comment is of particular interest as it revealed that containerd only allows starting 10 containers in parallel at once, putting any further requests in a queue. This might have provided an answer to this issue, however, repeating the reproduction steps above with 9 containers in step 2 resulted in a similar (albeit slightly less extreme) performance hit (~1.8s vs ~0.7s). Additionally, given that comment is 5 years old, I’m not sure if it’s still accurate regarding containerd limits.
I’m not entirely sure if this is expected behaviour or not. It would of course be expected behaviour if there was a contended resource shared between all 32 container startups that needed to be locked around for sequential access. I’m however not familiar enough with the docker/containerd codebases to know if such a resource exists. If it does, I’d be very grateful if someone could enlighten me to its existence 😛
Output of docker version
:
Client: Docker Engine - Community
Version: 20.10.3
API version: 1.41
Go version: go1.13.15
Git commit: 48d30b5
Built: Fri Jan 29 14:33:25 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.3
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 46229ca
Built: Fri Jan 29 14:31:38 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Output of docker info
:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 28
Server Version: 20.10.3
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.10.0-0.bpo.3-amd64
Operating System: Debian GNU/Linux 10 (buster)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 31.33GiB
Name: desktop
ID: MJDC:OLOZ:H7N2:S5L2:LO2Z:MOSP:6OUT:NVQZ:EI4N:6ZF7:T2V3:HDDH
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 22
Goroutines: 34
System Time: 2021-02-27T20:48:48.951671583-08:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio weight support
WARNING: No blkio weight_device support
Additional environment details (AWS, VirtualBox, physical, etc.):
Running on a Debian 10 workstation. Output of uname -a
:
Linux desktop 5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 11
- Comments: 15 (7 by maintainers)
I believe I might be experiencing the same issue when launching many containers in parallel in a VM on Google Compute Engine (GCE). I have tried to reproduce the issue on multiple bare-metal and non-GCE virtual machines, but so far I can only reproduce it on GCE where I have been able to consistently reproduce the issue on multiple different VMs.
Inspired by @GunshipPenguin, I ran a very similar experiment:
The reason I pause program for 1 second between invocations is so that the container schedule is “sustainable”. That is, since a single command
docker run --rm alpine /bin/true
takes only about 0.5 seconds, it should be possible for consecutive containers to run with very little contention, especially considering that the machine has many cores and more than enough memory. The GCE VM that this experiment is run on has the typee2-highcpu-32 (32 vCPUs, 32 GB memory)
. As the type “highcpu” suggests, I am using a sufficiently powerful machine. During the experiments CPU usage never exceeded 25%.In the following diagram I have plotted the response times (i.e., the elapsed time between calling
docker run
and the container finishing) of eachdocker run
command. At the beginning, the response times are very consistent and low. However, after the 42nd container they increase drastically.If I reduce the sleep time in the aforementioned experiment, the drastic increase in response time happens for earlier containers as well. E.g., if I set the sleep time to 0.5 sec, the 28th container is already affected by the high response time.
Output of
docker version
:Output of docker info:
Additional environment details (AWS, VirtualBox, physical, etc.):
Running on GCE VM
e2-highcpu-32 (32 vCPUs, 32 GB memory)
. Output ofuname -a
:Linux totoro 5.8.0-1032-gcp #34~20.04.1-Ubuntu SMP Wed May 19 18:19:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux