moby: Docker daemon hangs and becomes unresponsive

Description

Intermittently, the docker daemon becomes unresponsive. docker info hangs and does not show any information at all. Even stopping the docker daemon from systemd hangs and pretty much freezes the system

Steps to reproduce the issue:

  • Issue is intermittent and cannot reliably reproduce the hanging behavior.
  • Issue a systemctl stop docker command to stop the daemon.
  • The daemon does not stop within the default systemd timeout.
  • System attempts to kill the daemon.
  • Systemd itself “hangs” and some processes that belongs to docker (dockerd and docker-proxy) become “Zombie” processes
  • The behavior of the host becomes quite erratic, system commands do not work, tailing a log file and you cannot CTRL-C out of it, the box will not reboot, etc etc.

This also affects when you want to reboot a box, as there is an implicit “systemctl stop docker” in the shutdown sequence.

I have a feeling this started with the move to docker engine 1.12.1

Describe the results you received: Running docker info or any docker command hangs and shows no information at all. Running a systemctl stop docker command hangs and eventually freezes the entire system.

Describe the results you expected: Docker commands should work normally including docker info and display relevant information.

Additional information you deem important (e.g. issue happens only occasionally): Intermittent issue, but when it happens it is unrecoverable. The only recourse is to hard reboot the Linux (RHEL in this case) box.

[root@Docker-Demo4-wf ~]# uname -a
Linux Docker-Demo4-wf 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Output of docker version:

1.12.1

Output of docker info:

Containers: 24
Running: 5
Paused: 0
Stopped: 19
Images: 146
Server Version: 1.12.1
Storage Driver: devicemapper
Pool Name: docker-thinpool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 26.27 GB
Data Space Total: 91.39 GB
Data Space Available: 65.12 GB
Metadata Space Used: 9.339 MB
Metadata Space Total: 46.14 MB
Metadata Space Available: 36.8 MB
Thin Pool Minimum Free Space: 9.139 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay host bridge null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 27.48 GiB
Name: Docker-Demo4-wf
ID: NMUO:R424:WQOD:TMUU:C74X:NEHA:FU5O:6CQP:DZ4W:FSCJ:DU5Y:WJP5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Cluster Store: etcd://10.215.13.20:12379
Cluster Advertise: 10.215.13.23:12376
Insecure Registries:

Additional environment details (AWS, VirtualBox, physical, etc.): The node is on Azure, running RHEL 7.2.

[root@Docker-Demo4-wf ~]# ps ax | grep docker
6427 ?        Ssl   19:16 /usr/bin/dockerd
6433 ?        Ssl    1:02 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --shim docker-containerd-shim --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --runtime docker-runc
6523 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 27017 -container-ip 172.18.0.7 -container-port 27017
6534 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8888 -container-ip 172.18.0.24 -container-port 8080
6713 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 12376 -container-ip 172.17.0.2 -container-port 2376
6728 ?        Sl     0:00 docker-containerd-shim c696ad4ceadf912866c0d7b478105a0fe1c5db3ff9e96b6cb3affc67fd5d4302 /var/run/docker/libcontainerd/c696ad4ceadf912866c0d7b478105a0fe1c5db3ff9e96b6cb3affc67fd5d4302 docker-runc
6745 ?        Ssl    0:13 docker-proxy -l :2376 -ca /etc/docker/ssl/ca.pem -cert /etc/docker/ssl/cert.pem -key /etc/docker/ssl/key.pem
7101 ?        Sl     0:00 docker-containerd-shim cbbaa65ec5f84905d7329bb88cf3f71d689d4d8c4b52520d4bc0b91b3ceffb18 /var/run/docker/libcontainerd/cbbaa65ec5f84905d7329bb88cf3f71d689d4d8c4b52520d4bc0b91b3ceffb18 docker-runc
7116 ?        Ssl    0:02 /swarm join --discovery-opt kv.cacertfile=/etc/docker/ssl/ca.pem --discovery-opt kv.certfile=/etc/docker/ssl/cert.pem --discovery-opt kv.keyfile=/etc/docker/ssl/key.pem --discovery-opt kv.path=/docker/nodes --advertise 10.215.13.23:12376 --discovery-opt kv.cacertfile=/etc/docker/ssl/ca.pem --discovery-opt kv.certfile=/etc/docker/ssl/cert.pem --discovery-opt kv.keyfile=/etc/docker/ssl/key.pem --discovery-opt kv.path=/docker/nodes etcd://10.215.13.20:12379
7722 ?        Sl     0:00 docker-containerd-shim a39a0eed65791261d6408bec5ba3ae2861331e222e78487b3f4aa68d73d87e22 /var/run/docker/libcontainerd/a39a0eed65791261d6408bec5ba3ae2861331e222e78487b3f4aa68d73d87e22 docker-runc
7823 ?        Sl     0:00 docker-containerd-shim ecd477ff230f5f49f74cab2ec3ba403e86a3eda9c4c27d0576d0bc3961c3485e /var/run/docker/libcontainerd/ecd477ff230f5f49f74cab2ec3ba403e86a3eda9c4c27d0576d0bc3961c3485e docker-runc
7943 ?        Sl     0:00 docker-containerd-shim 0a53bdd51c2abf420cd347d7ae58cf163a182b0edf108e1c851893392aeb1b55 /var/run/docker/libcontainerd/0a53bdd51c2abf420cd347d7ae58cf163a182b0edf108e1c851893392aeb1b55 docker-runc
8062 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 32768 -container-ip 172.18.0.7 -container-port 15672
8067 ?        Sl     0:00 docker-containerd-shim 9b02e24eb3cd7f50b5806604f77bca6effb47093480a2b31f24bf529f8d9ac4b /var/run/docker/libcontainerd/9b02e24eb3cd7f50b5806604f77bca6effb47093480a2b31f24bf529f8d9ac4b docker-runc
8705 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9099 -container-ip 172.18.0.8 -container-port 9099
8714 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9098 -container-ip 172.18.0.8 -container-port 9098
8723 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9097 -container-ip 172.18.0.8 -container-port 9097
8731 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9096 -container-ip 172.18.0.8 -container-port 9096
8739 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9095 -container-ip 172.18.0.8 -container-port 9095
8747 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9094 -container-ip 172.18.0.8 -container-port 9094
8755 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9093 -container-ip 172.18.0.8 -container-port 9093
8763 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9092 -container-ip 172.18.0.8 -container-port 9092
8772 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9091 -container-ip 172.18.0.8 -container-port 9091
8780 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9090 -container-ip 172.18.0.8 -container-port 9090
8811 ?        Sl     0:08 docker-containerd-shim 979e005115f1c92de4a489e95eddb155827fcaccdd83677e0d91562c2a3db446 /var/run/docker/libcontainerd/979e005115f1c92de4a489e95eddb155827fcaccdd83677e0d91562c2a3db446 docker-runc
10777 ?        Sl     0:00 docker-containerd-shim 791bcda4cf657872108d9a383833a833f12b52740c35231f04de305a0aab706a /var/run/docker/libcontainerd/791bcda4cf657872108d9a383833a833f12b52740c35231f04de305a0aab706a docker-runc
22644 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 32770 -container-ip 172.18.0.9 -container-port 8888
22649 ?        Sl     0:00 docker-containerd-shim 8ff251a924bbe3526e581e0796b9b1644b51d022d00f0c7ac85b34d1829c485f /var/run/docker/libcontainerd/8ff251a924bbe3526e581e0796b9b1644b51d022d00f0c7ac85b34d1829c485f docker-runc
23349 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 30001 -container-ip 172.18.0.10 -container-port 8080
23381 ?        Sl     0:00 docker-containerd-shim ec214187a5573656ae14de5b58c70a565ad48f7406f3a92802436804a3dfe171 /var/run/docker/libcontainerd/ec214187a5573656ae14de5b58c70a565ad48f7406f3a92802436804a3dfe171 docker-runc
29258 ?        Sl     0:00 docker-containerd-shim 50fbbc0ef455bc1c626a836808ce7c64841703f91f1350606427b82e0d831b5b /var/run/docker/libcontainerd/50fbbc0ef455bc1c626a836808ce7c64841703f91f1350606427b82e0d831b5b docker-runc
59052 pts/6    S+     0:00 grep --color=auto docker

[root@Docker-Demo4-wf ~]# docker info (when issue occurs)
TIMEOUT

Running strace on the pid of the docker daemon produces:

wait4(11127,

11127 seems to be the pid for containerd.

While the issue is happening:

  •      strace –p PID of dockerd produces the file “strace.dockerd.out”
    
  •      the file doesn’t update, dockerd is just waiting.
    
  •      lsof | grep 11127 (PID from the strace.dockerd.out file) produces the content of lsof_grep_11127.out
    
  •      ps ax | grep 11127 produces the output of ps_ax_grep_11127.out
    
  •      strace –p 11127 (PID of docker-containerd ) produces “dockerd.strace.out”
    
  •      the file doesn’t update, docker-containerd is just waiting
    

Attaching these files inside Archive.zip:

  • strace.docker-containerd.out
  • ps_ax_grep_11127.out
  • lsof_grep_11127.out
  • strace.dockerd.out
  • messages
  • dmesg

Archive.zip

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 5
  • Comments: 20 (11 by maintainers)

Most upvoted comments

@diegito that seems like a different issue.

Could you open a new issue with this description along with the daemon logs (if you can reproduce it, putting the daemon in debug mode would give us more useful logging information, also if you can send a SIGUSR1 to dockerd when it occurs, it’ll help pinpoint where the daemon is blocked at)

@djsly 1.12.5 has the fix I mentioned regarding exec’s children. 1.12.6 indeed was only a CVE fix, hence why I’d recommend using it instead of 1.12.5