moby: Container HEALTHCHECKs can lead to hanging API calls and bad State
Description
Occasionally the Docker daemon stops responding to interactions with a containers that have HEALTHCHECKs. The problem presents itself in several older versions of Docker and the latest packaged versions on Ubuntu and v17.12.0-ce on Amazon Linux.
By our observations, this looks to be a race condition that is met with a deadlock that prevents further calls against affected containers.
This observed issue may be related to https://github.com/moby/moby/issues/35933 . I’m working on bisecting the releases (using https://github.com/docker/docker-ce) using this repro to narrow down the problem further in any case.
Steps to reproduce the issue:
A repro case has been built and run against several version of docker with positive results after a few rounds of execution (I recommend 10-20 rounds to tickle the bug). There likely isn’t anything specific about the 2 containers, but it has been positively triggering the bug for this test.
- Build container image with HEALTHCHECK defined (
echo helloevery1s) - Start 2 containers using image
- Wait some time (
10sin our test) - Stop containers
- Inspect containers
Describe the results you received:
Started containers appear to continue running and to be healthy despite being non-responsive.
ubuntu@ip-172-31-37-156:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0cf518c205f7 docker-poke:healthchecks "sh -c 'sleep 30m'" 17 minutes ago Up 17 minutes (healthy) sad_hugle
ubuntu@ip-172-31-37-156:~$ docker inspect 0cf518c205f7
^C
Additionally, the output of docker ps will continue reporting that the container is still up and running even though the process will exit after 30m (started with sleep 30m).
0cf518c205f7 docker-poke:healthchecks "sh -c 'sleep 30m'" 37 minutes ago Up 37 minutes (healthy) sad_hugle
Describe the results you expected:
I expected that I would be able to inspect this container.
docker inspect 0cf518c205f7
{
...
}
Additional information you deem important (e.g. issue happens only occasionally):
This issue is readily made apparent with a few concurrent runs, but otherwise lies dormant even with many serial runs.
Output of docker version:
Client:
Version: 17.12.1-ce
API version: 1.35
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:17:40 2018
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.1-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:16:13 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 5
Server Version: 17.12.1-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 4
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-1052-aws
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.625GiB
Name: ip-172-31-37-156
ID: L4W4:V4WA:OHSS:QTGL:DRJG:32GX:7DKK:FFLO:WKR2:IJYV:NKDG:GWRA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.):
Ubuntu 16.04 on AWS EC2 using the repro runner
Thanks @samuelkarp!
| Package | Result |
|---|---|
17.09.1~ce-0~ubuntu |
pass |
17.10.0~ce-0~ubuntu |
pass |
17.11.0~ce-0~ubuntu |
pass |
17.12.0~ce~rc1-0~ubuntu |
fail |
17.12.0~ce-0~ubuntu |
fail |
17.12.1~ce-0~ubuntu |
fail |
18.01.0~ce-0~ubuntu |
fail |
18.02.0~ce-0~ubuntu |
fail |
18.03.0~ce~rc4-0~ubuntu |
fail |
Amazon Linux on AWS EC2 using the repro runner
Thanks @jhaynes!
| Package | Result |
|---|---|
17.12.0-ce |
fail |
17.09.1-ce |
pass |
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 4
- Comments: 15 (14 by maintainers)
@cpuguy83 The issue appears to be resolved with 18.03.1 (tested on Amazon Linux and Ubuntu) 👍
We are kicking off a release of containerd 1.0.3 in https://github.com/containerd/containerd/pull/2242 with the mitigation. I think there is a more complete fix on the docker side to prevent this condition from happening.
@spex66
No data is lost from stopping/starting containers other than in-memory state
Fix a borken daemon? Restart dockerd.
More permanent fix: Upgrade to 18.03.1.
@jahkeup Is this resolved on your end with 18.03.1?
https://github.com/containerd/containerd/pull/2229 is a possible fix for this, or it at least alleviates the problem.
I still think something is awry with the healthchecks. There may be a rarer condition that is exasperated when the FIFOs block.