moby: Docker is incredibly slow to start many containers in parallel even on a many-core machine

Description

On a many-core machine (Ryzen 9 3950X in my case), starting 32 docker containers simultaneously results in the time needed to start any one of them increasing 10-20 fold.

Steps to reproduce the issue: 1.

On a multi-core machine, start a bunch of docker containers running /bin/true (which does nothing but exit with status 0, thus time needed to run the command should be negligible) in parallel. This can be done with GNU Parallel as follows:

parallel -j32 -N0 docker run --rm alpine /bin/true ::: {1..100}

In this case I’m starting starting 32 with -j32 as I’m on a 32 threaded machine.

Immediately thereafter, in another terminal run /bin/true in another docker container and measure the time it takes to complete:

time docker run --rm alpine /bin/true

After all containers from step 1 have exited, run the command in step 2 again and compare the times.

In my case, when running /bin/true with 32 other containers starting at the same time, I got:

real    0m7.399s
user    0m0.024s
sys     0m0.019s

When running without the other containers starting at the same time, I got:

real    0m0.603s
user    0m0.030s
sys     0m0.015s

Describe the results you received:

Time needed in step 2 was vastly greater than time needed in step 3.

Describe the results you expected:

Times in step 2 and 3 are comparable given container creation processes should be scheduled on different cores.

Additional information you deem important (e.g. issue happens only occasionally):

Thinking dockerd may be serializing around a lock or something of the sort, I generated two off CPU flame graphs of the dockerd process using tools from Brendan Gregg. They can be found zipped here: flamegraphs.zip. offcpu-dockerd-parallel.svg covers docker being used to start 32 containers in parallel and offcpu-dockerd-sequential.svg covers docker being to start 32 containers sequentially. As can be seen in the parallel case, dockerd spends a very large amount of time being blocked in the open syscall (see right hand side of graph), specifically when waiting for the other end of a FIFO to be opened. This is not present in the sequential case.

Running lsof on dockerd during the parallel test, I consistently got output resembling the following:

$ sudo lsof -p `pgrep -nx dockerd` | grep FIFO
dockerd 10755 root   26u     FIFO               0,23       0t0      77979 /run/docker/containerd/59335fc9195f9e4a23ae8ad6ac3b5539ba4f784053bc5dc2305731f405ad18b2/init-stdout                                      
dockerd 10755 root   31u     FIFO               0,23       0t0      78015 /run/docker/containerd/e4869e1ea909edf9e4fe29a148014ac0d24b437dfe8e64b0b0aac74665f64bb4/init-stdout                                      
dockerd 10755 root   49u     FIFO               0,23       0t0      77980 /run/docker/containerd/59335fc9195f9e4a23ae8ad6ac3b5539ba4f784053bc5dc2305731f405ad18b2/init-stderr                                      
dockerd 10755 root  114u     FIFO               0,23       0t0      78003 /run/docker/containerd/807fc9cef8a4834e3a6f671837e54c3bb3afad8ccbd0b491716c44ca08dba2fe/init-stdout                                      
dockerd 10755 root  117u     FIFO               0,23       0t0      78004 /run/docker/containerd/807fc9cef8a4834e3a6f671837e54c3bb3afad8ccbd0b491716c44ca08dba2fe/init-stderr                                      
dockerd 10755 root  132u     FIFO               0,23       0t0      78016 /run/docker/containerd/e4869e1ea909edf9e4fe29a148014ac0d24b437dfe8e64b0b0aac74665f64bb4/init-stderr

ie. the FIFOs being waited on seem to be /run/docker/containerd/<container_id>/init-[stderr,stdout].

Some googling resulted in me stumbling upon https://github.com/moby/moby/issues/29369, which seems to be a very similar situation. This comment is of particular interest as it revealed that containerd only allows starting 10 containers in parallel at once, putting any further requests in a queue. This might have provided an answer to this issue, however, repeating the reproduction steps above with 9 containers in step 2 resulted in a similar (albeit slightly less extreme) performance hit (~1.8s vs ~0.7s). Additionally, given that comment is 5 years old, I’m not sure if it’s still accurate regarding containerd limits.

I’m not entirely sure if this is expected behaviour or not. It would of course be expected behaviour if there was a contended resource shared between all 32 container startups that needed to be locked around for sequential access. I’m however not familiar enough with the docker/containerd codebases to know if such a resource exists. If it does, I’d be very grateful if someone could enlighten me to its existence 😛

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.3
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        48d30b5
 Built:             Fri Jan 29 14:33:25 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.3
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       46229ca
  Built:            Fri Jan 29 14:31:38 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 28
 Server Version: 20.10.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.10.0-0.bpo.3-amd64
 Operating System: Debian GNU/Linux 10 (buster)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 31.33GiB
 Name: desktop
 ID: MJDC:OLOZ:H7N2:S5L2:LO2Z:MOSP:6OUT:NVQZ:EI4N:6ZF7:T2V3:HDDH
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 22
  Goroutines: 34
  System Time: 2021-02-27T20:48:48.951671583-08:00
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio weight support
WARNING: No blkio weight_device support

Additional environment details (AWS, VirtualBox, physical, etc.):

Running on a Debian 10 workstation. Output of uname -a:

Linux desktop 5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 11
Comments: 15 (7 by maintainers)

Most upvoted comments

I believe I might be experiencing the same issue when launching many containers in parallel in a VM on Google Compute Engine (GCE). I have tried to reproduce the issue on multiple bare-metal and non-GCE virtual machines, but so far I can only reproduce it on GCE where I have been able to consistently reproduce the issue on multiple different VMs.

Inspired by @GunshipPenguin, I ran a very similar experiment:

for i in {1..100}
do
   time docker run --rm alpine /bin/true &
   sleep 1
done

The reason I pause program for 1 second between invocations is so that the container schedule is “sustainable”. That is, since a single command docker run --rm alpine /bin/true takes only about 0.5 seconds, it should be possible for consecutive containers to run with very little contention, especially considering that the machine has many cores and more than enough memory. The GCE VM that this experiment is run on has the type e2-highcpu-32 (32 vCPUs, 32 GB memory). As the type “highcpu” suggests, I am using a sufficiently powerful machine. During the experiments CPU usage never exceeded 25%.

In the following diagram I have plotted the response times (i.e., the elapsed time between calling docker run and the container finishing) of each docker run command. At the beginning, the response times are very consistent and low. However, after the 42nd container they increase drastically. gce_docker_containers_sequence

If I reduce the sleep time in the aforementioned experiment, the drastic increase in response time happens for earlier containers as well. E.g., if I set the sleep time to 0.5 sec, the 28th container is already affected by the high response time.

Output of `docker version`:

Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:56:38 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:54:50 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc., v0.8.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 9
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.8.0-1032-gcp
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 31.36GiB
 Name: totoro
 ID: V66F:SFDV:FFFT:KJWC:VLT7:E5H2:HFUG:QG3F:HFZA:UYGZ:NFX2:TYGG
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Running on GCE VM e2-highcpu-32 (32 vCPUs, 32 GB memory). Output of uname -a: Linux totoro 5.8.0-1032-gcp #34~20.04.1-Ubuntu SMP Wed May 19 18:19:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

vfoehn on Jun 21, 2021