moby: Container HEALTHCHECKs can lead to hanging API calls and bad State
Description
Occasionally the Docker daemon stops responding to interactions with a containers that have HEALTHCHECKs. The problem presents itself in several older versions of Docker and the latest packaged versions on Ubuntu and v17.12.0-ce on Amazon Linux.
By our observations, this looks to be a race condition that is met with a deadlock that prevents further calls against affected containers.
This observed issue may be related to https://github.com/moby/moby/issues/35933 . I’m working on bisecting the releases (using https://github.com/docker/docker-ce) using this repro to narrow down the problem further in any case.
Steps to reproduce the issue:
A repro case has been built and run against several version of docker with positive results after a few rounds of execution (I recommend 10-20 rounds to tickle the bug). There likely isn’t anything specific about the 2 containers, but it has been positively triggering the bug for this test.
- Build container image with HEALTHCHECK defined (
echo hello
every1s
) - Start 2 containers using image
- Wait some time (
10s
in our test) - Stop containers
- Inspect containers
Describe the results you received:
Started containers appear to continue running and to be healthy despite being non-responsive.
ubuntu@ip-172-31-37-156:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0cf518c205f7 docker-poke:healthchecks "sh -c 'sleep 30m'" 17 minutes ago Up 17 minutes (healthy) sad_hugle
ubuntu@ip-172-31-37-156:~$ docker inspect 0cf518c205f7
^C
Additionally, the output of docker ps
will continue reporting that the container is still up and running even though the process will exit after 30m (started with sleep 30m
).
0cf518c205f7 docker-poke:healthchecks "sh -c 'sleep 30m'" 37 minutes ago Up 37 minutes (healthy) sad_hugle
Describe the results you expected:
I expected that I would be able to inspect this container.
docker inspect 0cf518c205f7
{
...
}
Additional information you deem important (e.g. issue happens only occasionally):
This issue is readily made apparent with a few concurrent runs, but otherwise lies dormant even with many serial runs.
Output of docker version
:
Client:
Version: 17.12.1-ce
API version: 1.35
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:17:40 2018
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.1-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:16:13 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 5
Server Version: 17.12.1-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 4
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-1052-aws
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.625GiB
Name: ip-172-31-37-156
ID: L4W4:V4WA:OHSS:QTGL:DRJG:32GX:7DKK:FFLO:WKR2:IJYV:NKDG:GWRA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.):
Ubuntu 16.04 on AWS EC2 using the repro runner
Thanks @samuelkarp!
Package | Result |
---|---|
17.09.1~ce-0~ubuntu |
pass |
17.10.0~ce-0~ubuntu |
pass |
17.11.0~ce-0~ubuntu |
pass |
17.12.0~ce~rc1-0~ubuntu |
fail |
17.12.0~ce-0~ubuntu |
fail |
17.12.1~ce-0~ubuntu |
fail |
18.01.0~ce-0~ubuntu |
fail |
18.02.0~ce-0~ubuntu |
fail |
18.03.0~ce~rc4-0~ubuntu |
fail |
Amazon Linux on AWS EC2 using the repro runner
Thanks @jhaynes!
Package | Result |
---|---|
17.12.0-ce |
fail |
17.09.1-ce |
pass |
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 4
- Comments: 15 (14 by maintainers)
@cpuguy83 The issue appears to be resolved with 18.03.1 (tested on Amazon Linux and Ubuntu) 👍
We are kicking off a release of containerd 1.0.3 in https://github.com/containerd/containerd/pull/2242 with the mitigation. I think there is a more complete fix on the docker side to prevent this condition from happening.
@spex66
No data is lost from stopping/starting containers other than in-memory state
Fix a borken daemon? Restart dockerd.
More permanent fix: Upgrade to 18.03.1.
@jahkeup Is this resolved on your end with 18.03.1?
https://github.com/containerd/containerd/pull/2229 is a possible fix for this, or it at least alleviates the problem.
I still think something is awry with the healthchecks. There may be a rarer condition that is exasperated when the FIFOs block.