moby: Docker daemon hangs intermittently after upgrading to docker 1.12.2

Description

We are encountering intermittent hangs of the docker daemon after moving to docker 1.12.2. Restarting docker appears to be the only way to recover. This appears to be similar to https://github.com/docker/docker/issues/25321.

It was also suggested that we move to docker 1.12.3. As this will require significant effort on our end, it would be great to get feedback on if the stack trace is indicative of an issue that was fixed between 1.12.2 and 1.12.3.

Steps to reproduce the issue:

  1. Issue standard docker commands, happens intermittently across a large number of nodes, so I don’t have a repeatable set to steps.

Describe the results you received: Docker hangs intermittently. Restarting docker fixes the issue.

Describe the results you expected: Docker shouldn’t hang.

Additional information you deem important (e.g. issue happens only occasionally): Only happening intermittently.

One potential workaround mentioned on https://github.com/docker/docker/issues/25321 is moving to overlayfs. We are using an lvm thinpool devicemapper setup leveraging xfs and unfortunately overlay is not very stable on Centos 7.2 so making that move is not currently an option. Doing so would be quite disruptive to existing workloads as well.

Output of docker version:

Client:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 105
 Running: 19
 Paused: 0
 Stopped: 86
Images: 108
Server Version: 1.12.2
Storage Driver: devicemapper
 Pool Name: vg01-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 274.9 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 853.2 GB
 Data Space Total: 5.63 TB
 Data Space Available: 4.777 TB
 Metadata Space Used: 105.9 MB
 Metadata Space Total: 16.98 GB
 Metadata Space Available: 16.87 GB
 Thin Pool Minimum Free Space: 563 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host null overlay bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 251.6 GiB
Name: foo.example.com
ID: OSYP:WEPA:N2LF:KFTJ:BZNP:L3PT:LDNV:4OBJ:A4AM:CWFB:HHIK:WN4M
Docker Root Dir: /grid/0/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8

Here is the stack trace from dockerd: https://gist.github.com/sakserv/6aa7c7a1a8eac147e27d8c060023a36d

Docker client strace:

-snip-
etpeername(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc82004a908, FUTEX_WAKE, 1)      = 1
read(4, 0xc820327000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(4, "GET /v1.24/info HTTP/1.1\r\nHost: "..., 84) = 84
epoll_wait(5, {}, 128, 0)               = 0
futex(0x132cca8, FUTEX_WAIT, 0, NULL

dockerd strace:

Process 850182 attached
wait4(850188,

docker-containerd (pid 850188) strace:

Process 850188 attached
futex(0xee44c8, FUTEX_WAIT, 0, NULL

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

If a daemon is stuck, you can send a SIGUSR1 to the dockerd process, which will dump a stack-trace in the daemon logs (docker 1.12) or in a separate file (docker 1.13). That dump should provide more information what is happening. It’s still possible that there’s a deadlock in the daemon; work is in progress on improving that, but design/approach is being discussed (see https://github.com/docker/docker/issues/30225)