moby: Concurrently starting many containers from same checkpoint leads to error: "context deadline exceeded"

Description

I’m using checkpoint/restore to restore multiple containers from the same checkpoint. Sometimes these restores happen concurrently. In particular, as part of our bootstrap process, we restore 45 containers from the same checkpoint when a new EC2 machine is launched; these restores happen concurrently. What I’m finding is that doing many concurrent restores usually leads to a number of these restores failing. The failure mode is:

"Error response from daemon: shim error: containerd-shim: opening /var/run/docker/libcontainerd/f7bbb5a85645fb1d34b2ab4966b2893c4507c900d77e0104515451f440181d29/init-stdout failed: context deadline exceeded"

In the daemon logs, I always see several occurrences of the following error prior to any occurrences of the “context deadline exceeded” error:

level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"

The application I’m running in my container has a large memory footprint (~1.1G when idle), and so it’s not too surprising that a few dozen concurrent restores of this application leads to some kind of issue. In fact, I’ve confirmed that adding sleep calls in between the restores in our bootstrap prevents the issue from occuring. So, resource overload/exhaustion is definitely at least one part of this. However, I was curious if there’s anything to be done on the Docker/criu side to better support use cases like this.

I’ve been looking into how memory and I/O usage play into this. When I tried doing 45+ concurrent restores of a simple Ubuntu container running a bash loop, I was unable to reproduce the issue. I have, however, been able to reproduce the issue when I run a simple C program in a container which just allocates several hundred MB of memory. I’ve confirmed that the issue doesn’t seem to be with exhausting the host’s memory. I’ve measured memory usage on the host during the bootstrap of our actual application, and the “context deadline exceeded” error starts happening when there’s still 90GB of free memory on the host. Experiments with the toy container have also supported this.

As for I/O, this seems more likely to be where the problem is, and we’ve had issues before with being throttled due to doing too much I/O on our EC2 instance’s root device. However, I/O throttling doesn’t seem to be the issue here. I’ve tried setting the exec-root, docker root (“graph”), and checkpoint-dir all to locations on an instance’s ephemeral storage, which isn’t subject to throttling by AWS and which should achieve better performance, and the issue did not go away or improve. I also tried increasing the number of IOPS provisioned for the root device from 100 to 1000, and this also did not seem to really make a difference.

The other thing I’ve noticed is that the number of concurrent restores is a factor, independent of e.g. memory consumption. I’ve tried playing around with the size of my toy C program and the number of concurrent restores, and I’ve found that the latter perhaps matters more than the former for reproducting the “context deadline exceeded” issue. With a 1.1GB application running in the container, doing 100 concurrent restores will rarely lead to any failures (total of 110GB of memory usage). However, doing 180 concurrent restores of a 550MB application will lead to at least 16 failed restores (total of 99GB of memory usage). Furthermore, doing 250 concurrent restores of a 300MB application will lead to at least 44 failed restores (total of just 75GB of memory usage). It seems like concurrent restores don’t quite scale as well as maybe they should.

Although in my experiments I’ve often done 100-200+ concurrent restores to reproduce the issue, keep in mind that with our actual application, the issue is seen when doing just 45 concurrent restores, which doesn’t seem to be an unreasonably high number.

One last thing is that while doing these experiments, in addition to the error provided above, I’ve also occasionally seen the following criu error: “Error (criu/mount.c:414): mnt: No root found for mountpoint 978 (@./sys/fs/pstore)”. The actual mountpoint flagged varies, but the message is the same.

I’m attaching the docker daemon logs from the bootstrapping of my actual application, as well as the sample C program, Dockerfile, and shell script I used to reproduce the issue more simply. docker_cr_concurrent_restores.zip

Steps to reproduce the issue:

  1. docker build -t test:latest .
  2. docker run -d --security-opt=seccomp:unconfined --name cr test
  3. docker checkpoint create --checkpoint-dir /var/lib/docker/checkpoints cr test_cp
  4. ./stress_cr.sh 250
  5. Vary the number of concurrent restores (argument to stress_cr.sh) and the application size in lots_of_memory.c

Describe the results you received: Many restores fail with “context deadline exceeded”

Describe the results you expected: Restores succeed

Output of docker version:

Client:
 Version:      1.13.0-rc3
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   4d92237
 Built:        Tue Dec  6 01:15:44 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0-rc3
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   4d92237
 Built:        Tue Dec  6 01:15:44 2016
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 35
Server Version: 1.13.0-rc3
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 51371867a01c467f08af739783b8beafc154c4d7
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.26-040426-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 120.1 GiB
Name: ip-10-97-5-247
ID: TEID:T5YB:6YAX:E4VG:4QTD:QQQ3:RWRS:6KQU:ALGA:SQTT:TPS6:GZ2I
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): AWS EC2. Using r3.4xlarge instance type. 20GB root device with 100 IOPS. Also tried using 1000 IOPS.

Have tried using criu version 2.8 and 2.9.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 26 (23 by maintainers)

Commits related to this issue

Most upvoted comments

@tswift242 I think we figured it out. Here is our assumption:

I had forgotten, but containerd only allow starting 10 containers at a time, the other ones are queued. Since docker doesn’t account for this, the fifo opening on its side timeouts before containerd-shim even have the time to start. This is the cause of the stream copy error. Then later on containerd-shim finally starts and timeout on its side because docker doesn’t have the other end open anymore. This occurs because restoring a checkpoint slows down the whole container start time quite a bit.

As a work around could you update your system to only have 10 restore requests at a time? That should prevent the issue without having too big of a performance cost.