moby: Container HEALTHCHECKs can lead to hanging API calls and bad State

Description

Occasionally the Docker daemon stops responding to interactions with a containers that have HEALTHCHECKs. The problem presents itself in several older versions of Docker and the latest packaged versions on Ubuntu and v17.12.0-ce on Amazon Linux.

By our observations, this looks to be a race condition that is met with a deadlock that prevents further calls against affected containers.

This observed issue may be related to https://github.com/moby/moby/issues/35933 . I’m working on bisecting the releases (using https://github.com/docker/docker-ce) using this repro to narrow down the problem further in any case.

Steps to reproduce the issue:

A repro case has been built and run against several version of docker with positive results after a few rounds of execution (I recommend 10-20 rounds to tickle the bug). There likely isn’t anything specific about the 2 containers, but it has been positively triggering the bug for this test.

  1. Build container image with HEALTHCHECK defined (echo hello every 1s)
  2. Start 2 containers using image
  3. Wait some time (10s in our test)
  4. Stop containers
  5. Inspect containers

Describe the results you received:

Started containers appear to continue running and to be healthy despite being non-responsive.

ubuntu@ip-172-31-37-156:~$ docker ps
CONTAINER ID        IMAGE                      COMMAND               CREATED             STATUS                    PORTS               NAMES
0cf518c205f7        docker-poke:healthchecks   "sh -c 'sleep 30m'"   17 minutes ago      Up 17 minutes (healthy)                       sad_hugle
ubuntu@ip-172-31-37-156:~$ docker inspect 0cf518c205f7
^C

Additionally, the output of docker ps will continue reporting that the container is still up and running even though the process will exit after 30m (started with sleep 30m).

0cf518c205f7        docker-poke:healthchecks   "sh -c 'sleep 30m'"   37 minutes ago      Up 37 minutes (healthy)                               sad_hugle

Describe the results you expected:

I expected that I would be able to inspect this container.

docker inspect 0cf518c205f7
{
   ...
}

Additional information you deem important (e.g. issue happens only occasionally):

This issue is readily made apparent with a few concurrent runs, but otherwise lies dormant even with many serial runs.

Output of docker version:

Client:
 Version:       17.12.1-ce
 API version:   1.35
 Go version:    go1.9.4
 Git commit:    7390fc6
 Built: Tue Feb 27 22:17:40 2018
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.1-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   7390fc6
  Built:        Tue Feb 27 22:16:13 2018
  OS/Arch:      linux/amd64
  Experimental: false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 17.12.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 4
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1052-aws
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.625GiB
Name: ip-172-31-37-156
ID: L4W4:V4WA:OHSS:QTGL:DRJG:32GX:7DKK:FFLO:WKR2:IJYV:NKDG:GWRA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Ubuntu 16.04 on AWS EC2 using the repro runner

Thanks @samuelkarp!

Package Result
17.09.1~ce-0~ubuntu pass
17.10.0~ce-0~ubuntu pass
17.11.0~ce-0~ubuntu pass
17.12.0~ce~rc1-0~ubuntu fail
17.12.0~ce-0~ubuntu fail
17.12.1~ce-0~ubuntu fail
18.01.0~ce-0~ubuntu fail
18.02.0~ce-0~ubuntu fail
18.03.0~ce~rc4-0~ubuntu fail

Amazon Linux on AWS EC2 using the repro runner

Thanks @jhaynes!

Package Result
17.12.0-ce fail
17.09.1-ce pass

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 4
  • Comments: 15 (14 by maintainers)

Most upvoted comments

@cpuguy83 The issue appears to be resolved with 18.03.1 (tested on Amazon Linux and Ubuntu) 👍

We are kicking off a release of containerd 1.0.3 in https://github.com/containerd/containerd/pull/2242 with the mitigation. I think there is a more complete fix on the docker side to prevent this condition from happening.

@spex66

Is there a way to start/stop the containers without risking to lose data?

No data is lost from stopping/starting containers other than in-memory state

more general, how to get control of a docker deployment which is in this state?

Fix a borken daemon? Restart dockerd.

More permanent fix: Upgrade to 18.03.1.


@jahkeup Is this resolved on your end with 18.03.1?

https://github.com/containerd/containerd/pull/2229 is a possible fix for this, or it at least alleviates the problem.

I still think something is awry with the healthchecks. There may be a rarer condition that is exasperated when the FIFOs block.