moby: runc does not terminate causing containerd-shim to hang in docker 18.09.8
Description
runc hangs sometimes and needs to be killed in order for the rest of the system to continue to work properly. We run Kubernetes and so the most noticeable symptom for us is kubelet on the host would start to show PLEG timeouts and the k8s node status keeps flapping between NotReady and Ready. It appears the containerd-shim responsible for the runc and the container stops responding.
We can still interact with docker for the most part, and I don’t believe we see issues other than kubelet not able to report container events. docker ps
would show the container ID with status created but docker inspect the container would hang.
Steps to reproduce the issue: Can’t reproduce it reliably, but happens couple times a day on our k8s cluster with 50+ nodes.
Describe the results you received: runc does not terminate, and docker inspect <container_id> hangs. Kubelet starts getting PLEG timeouts and keeps switching between NotReady and Ready states
Describe the results you expected: runc should not get hung. Even if it does, container-shim should not hang and maybe it can kill the hung runc process
Additional information you deem important (e.g. issue happens only occasionally): The issue happens occasionally, but killing the hung runc process restores the system
Output of docker version
:
Client:
Version: 18.09.8
API version: 1.39
Go version: go1.13beta1
Git commit: 0dd43dd
Built: Fri Jul 26 03:04:01 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.09.8
API version: 1.39 (minimum version 1.12)
Go version: go1.13beta1
Git commit: 0dd43dd
Built: Thu Jul 25 00:00:00 2019
OS/Arch: linux/amd64
Experimental: true
Output of docker info
:
Containers: 58
Running: 41
Paused: 0
Stopped: 17
Images: 477
Server Version: 18.09.8
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: /usr/libexec/docker/docker-init
containerd version:
runc version: 96f6022b37cbe12b26c9ad33a24677bec72a9cc3
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
seccomp
Profile: default
selinux
Kernel Version: 5.5.5-200.fc31.x86_64
Operating System: Fedora CoreOS 31.20200223.3.0
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 157.4GiB
Name: ip-172-16-195-194
ID: ZT74:5TEV:ZEH5:RH5Z:KYNG:ZW27:6YR3:36LK:P7CF:VN3Z:UUVN:MWMH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.): Servers are EC2 instances (m4.10xlarge) running Fedora CoreOS 31.20200223.3.0 and linux kernel 5.5.5-200.fc31.x86_64
List of hung runc processes on one of the nodes:
ps -ef | grep -v containerd-shim | grep runc
root 172466 906146 0 Apr12 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/436a87adfc7ce158253dc96385fc8c5e3f8db3fcfffee30c73deba4b2437a3d5/log.json --log-format json --systemd-cgroup state 436a87adfc7ce158253dc96385fc8c5e3f8db3fcfffee30c73deba4b2437a3d5
core 177996 4160065 0 19:42 pts/1 00:00:00 grep --color=auto runc
root 417556 909515 0 Apr07 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/35f4ff809b892acbdb5e8d64449a631ce1e07426b55ab56f5035c69c24526425/log.json --log-format json --systemd-cgroup state 35f4ff809b892acbdb5e8d64449a631ce1e07426b55ab56f5035c69c24526425
root 1548610 285369 0 Apr08 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/afef68e5ee102a6f62443e88a4ad24748ea9a001b8aefbbef8024c2de44c20b7/log.json --log-format json --systemd-cgroup state afef68e5ee102a6f62443e88a4ad24748ea9a001b8aefbbef8024c2de44c20b7
root 2045855 283980 0 Apr11 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/87ee777b30bea3feca0b3f97a7b3dff3eb446b40dacc02befb5c56b696f52760/log.json --log-format json --systemd-cgroup state 87ee777b30bea3feca0b3f97a7b3dff3eb446b40dacc02befb5c56b696f52760
root 2449133 286976 0 Apr07 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/f1a47caae6655765958f2dbbb644feda55c53c37958a8bbac6babc29cbd2a5e2/log.json --log-format json --systemd-cgroup state f1a47caae6655765958f2dbbb644feda55c53c37958a8bbac6babc29cbd2a5e2
root 2522488 908218 0 Apr09 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/de231ec23653d3589e03165cbd41bfe12116445c674c715b7aad8d9156b15e2f/log.json --log-format json --systemd-cgroup state de231ec23653d3589e03165cbd41bfe12116445c674c715b7aad8d9156b15e2f
root 2969519 284372 0 Apr09 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/7df38dd9f1fd3447926bc7413bfb08a20852cf16bb71b3daeb2665c0779eef68/log.json --log-format json --systemd-cgroup state 7df38dd9f1fd3447926bc7413bfb08a20852cf16bb71b3daeb2665c0779eef68
root 3067508 286131 0 11:09 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/c6741b012b9f3fd1e1be4a7b26e50b0a202f3d1dca3037ce9fe6c6c988c593a7/log.json --log-format json --systemd-cgroup state c6741b012b9f3fd1e1be4a7b26e50b0a202f3d1dca3037ce9fe6c6c988c593a7
root 3102983 284910 0 Apr09 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/304a4f1267856fbde5dae2773f0e7aabb6912d4c5d9f53b64891e0d1484dbce3/log.json --log-format json --systemd-cgroup state 304a4f1267856fbde5dae2773f0e7aabb6912d4c5d9f53b64891e0d1484dbce3
strace on the hung process shows its waiting for FUTEX_WAIT_PRIVATE
strace -p 172466
strace: Process 172466 attached
futex(0x55db073239a0, FUTEX_WAIT_PRIVATE, 0, NULL
docker shows container as running:
docker ps | grep 436a87adfc7c
436a87adfc7c busybox "sleep infinity" 7 days ago Up 7 days k8s_probe-test_probe-test-74c994498-6q8mm_default_5ab87edf-b94f-43a4-ad5a-8da7493fd42a_0
but docker inspect 436a87adfc7c
hangs
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 4
- Comments: 23 (6 by maintainers)
@cpuguy83 @thaJeztah
[ALPM] upgraded containerd (1.3.4-2 -> 1.4.1-1)
It worked! I upgraded containerd to 1.4.1 and now again everything is running stable. Thanks!I swear ninjas come and change what I type…
probably meant v1.4.0 here; fix for that was in v1.4.1
@cobrafast I think I am in the same situation.
Server: Containers: 16 Running: 14 Paused: 0 Stopped: 2 Images: 155 Server Version: 19.03.12-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: false Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 5.4.63-1-lts Operating System: Arch Linux OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 7.579GiB Name: *** ID: *** Docker Root Dir: /var/lib/docker Debug Mode: false Username: *** Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
trouble started for me after 29th august, when I did an upgrade, which included this: [ALPM] upgraded containerd (1.3.4-2 -> 1.4.0-2)
So today I downgraded containerd back to 1.3.4-2 and everything seems to run stable now. Hopefully this is fixed in a future update.
@MannuSD you look to be running outdated versions of docker, containerd and runc; I recall there were a couple of fixes around health-checks in recent versions
@cobrafast @robertalpha Need to upgrade to containerd 1.4.1 or downgrade to containerd 1.3.7. 1.4.1 has a bug in the v1 shim, which is the only shim Docker uses.
I too think I am affected by this problem (but on apparently different versions as previously reported) as I too see
containerd-shim
processes stuck onfutex(0x55c0263fe388, FUTEX_WAIT_PRIVATE, 0, NULL
. All containers with healthchecks become “unhealthy” after a couple hours even though the app running inside is still working fine. Stopping containers fails because thecontainerd-shim
process won’t exit and needs to be killed manually.Edit: Recently performed package upgrades that may be relevant as the problem only started manifesting a few weeks ago for me: