moby: Docker daemon hangs intermittently after upgrading to docker 1.12.2
Description
We are encountering intermittent hangs of the docker daemon after moving to docker 1.12.2. Restarting docker appears to be the only way to recover. This appears to be similar to https://github.com/docker/docker/issues/25321.
It was also suggested that we move to docker 1.12.3. As this will require significant effort on our end, it would be great to get feedback on if the stack trace is indicative of an issue that was fixed between 1.12.2 and 1.12.3.
Steps to reproduce the issue:
- Issue standard docker commands, happens intermittently across a large number of nodes, so I don’t have a repeatable set to steps.
Describe the results you received: Docker hangs intermittently. Restarting docker fixes the issue.
Describe the results you expected: Docker shouldn’t hang.
Additional information you deem important (e.g. issue happens only occasionally): Only happening intermittently.
One potential workaround mentioned on https://github.com/docker/docker/issues/25321 is moving to overlayfs. We are using an lvm thinpool devicemapper setup leveraging xfs and unfortunately overlay is not very stable on Centos 7.2 so making that move is not currently an option. Doing so would be quite disruptive to existing workloads as well.
Output of docker version
:
Client:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built:
OS/Arch: linux/amd64
Server:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built:
OS/Arch: linux/amd64
Output of docker info
:
Containers: 105
Running: 19
Paused: 0
Stopped: 86
Images: 108
Server Version: 1.12.2
Storage Driver: devicemapper
Pool Name: vg01-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 274.9 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 853.2 GB
Data Space Total: 5.63 TB
Data Space Available: 4.777 TB
Metadata Space Used: 105.9 MB
Metadata Space Total: 16.98 GB
Metadata Space Available: 16.87 GB
Thin Pool Minimum Free Space: 563 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.107-RHEL7 (2015-12-01)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host null overlay bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 251.6 GiB
Name: foo.example.com
ID: OSYP:WEPA:N2LF:KFTJ:BZNP:L3PT:LDNV:4OBJ:A4AM:CWFB:HHIK:WN4M
Docker Root Dir: /grid/0/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
127.0.0.0/8
Here is the stack trace from dockerd: https://gist.github.com/sakserv/6aa7c7a1a8eac147e27d8c060023a36d
Docker client strace:
-snip-
etpeername(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc82004a908, FUTEX_WAKE, 1) = 1
read(4, 0xc820327000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
write(4, "GET /v1.24/info HTTP/1.1\r\nHost: "..., 84) = 84
epoll_wait(5, {}, 128, 0) = 0
futex(0x132cca8, FUTEX_WAIT, 0, NULL
dockerd strace:
Process 850182 attached
wait4(850188,
docker-containerd (pid 850188) strace:
Process 850188 attached
futex(0xee44c8, FUTEX_WAIT, 0, NULL
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 16 (8 by maintainers)
If a daemon is stuck, you can send a
SIGUSR1
to thedockerd
process, which will dump a stack-trace in the daemon logs (docker 1.12) or in a separate file (docker 1.13). That dump should provide more information what is happening. It’s still possible that there’s a deadlock in the daemon; work is in progress on improving that, but design/approach is being discussed (see https://github.com/docker/docker/issues/30225)