moby: runc does not terminate causing containerd-shim to hang in docker 18.09.8

Description

runc hangs sometimes and needs to be killed in order for the rest of the system to continue to work properly. We run Kubernetes and so the most noticeable symptom for us is kubelet on the host would start to show PLEG timeouts and the k8s node status keeps flapping between NotReady and Ready. It appears the containerd-shim responsible for the runc and the container stops responding.

We can still interact with docker for the most part, and I don’t believe we see issues other than kubelet not able to report container events. docker ps would show the container ID with status created but docker inspect the container would hang.

Steps to reproduce the issue: Can’t reproduce it reliably, but happens couple times a day on our k8s cluster with 50+ nodes.

Describe the results you received: runc does not terminate, and docker inspect <container_id> hangs. Kubelet starts getting PLEG timeouts and keeps switching between NotReady and Ready states

Describe the results you expected: runc should not get hung. Even if it does, container-shim should not hang and maybe it can kill the hung runc process

Additional information you deem important (e.g. issue happens only occasionally): The issue happens occasionally, but killing the hung runc process restores the system

Output of docker version:

Client:
 Version:           18.09.8
 API version:       1.39
 Go version:        go1.13beta1
 Git commit:        0dd43dd
 Built:             Fri Jul 26 03:04:01 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.8
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.13beta1
  Git commit:       0dd43dd
  Built:            Thu Jul 25 00:00:00 2019
  OS/Arch:          linux/amd64
  Experimental:     true

Output of docker info:

Containers: 58
 Running: 41
 Paused: 0
 Stopped: 17
Images: 477
Server Version: 18.09.8
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: /usr/libexec/docker/docker-init
containerd version:
runc version: 96f6022b37cbe12b26c9ad33a24677bec72a9cc3
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 5.5.5-200.fc31.x86_64
Operating System: Fedora CoreOS 31.20200223.3.0
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 157.4GiB
Name: ip-172-16-195-194
ID: ZT74:5TEV:ZEH5:RH5Z:KYNG:ZW27:6YR3:36LK:P7CF:VN3Z:UUVN:MWMH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): Servers are EC2 instances (m4.10xlarge) running Fedora CoreOS 31.20200223.3.0 and linux kernel 5.5.5-200.fc31.x86_64

List of hung runc processes on one of the nodes:

ps -ef | grep -v containerd-shim | grep runc
root      172466  906146  0 Apr12 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/436a87adfc7ce158253dc96385fc8c5e3f8db3fcfffee30c73deba4b2437a3d5/log.json --log-format json --systemd-cgroup state 436a87adfc7ce158253dc96385fc8c5e3f8db3fcfffee30c73deba4b2437a3d5
core      177996 4160065  0 19:42 pts/1    00:00:00 grep --color=auto runc
root      417556  909515  0 Apr07 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/35f4ff809b892acbdb5e8d64449a631ce1e07426b55ab56f5035c69c24526425/log.json --log-format json --systemd-cgroup state 35f4ff809b892acbdb5e8d64449a631ce1e07426b55ab56f5035c69c24526425
root     1548610  285369  0 Apr08 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/afef68e5ee102a6f62443e88a4ad24748ea9a001b8aefbbef8024c2de44c20b7/log.json --log-format json --systemd-cgroup state afef68e5ee102a6f62443e88a4ad24748ea9a001b8aefbbef8024c2de44c20b7
root     2045855  283980  0 Apr11 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/87ee777b30bea3feca0b3f97a7b3dff3eb446b40dacc02befb5c56b696f52760/log.json --log-format json --systemd-cgroup state 87ee777b30bea3feca0b3f97a7b3dff3eb446b40dacc02befb5c56b696f52760
root     2449133  286976  0 Apr07 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/f1a47caae6655765958f2dbbb644feda55c53c37958a8bbac6babc29cbd2a5e2/log.json --log-format json --systemd-cgroup state f1a47caae6655765958f2dbbb644feda55c53c37958a8bbac6babc29cbd2a5e2
root     2522488  908218  0 Apr09 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/de231ec23653d3589e03165cbd41bfe12116445c674c715b7aad8d9156b15e2f/log.json --log-format json --systemd-cgroup state de231ec23653d3589e03165cbd41bfe12116445c674c715b7aad8d9156b15e2f
root     2969519  284372  0 Apr09 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/7df38dd9f1fd3447926bc7413bfb08a20852cf16bb71b3daeb2665c0779eef68/log.json --log-format json --systemd-cgroup state 7df38dd9f1fd3447926bc7413bfb08a20852cf16bb71b3daeb2665c0779eef68
root     3067508  286131  0 11:09 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/c6741b012b9f3fd1e1be4a7b26e50b0a202f3d1dca3037ce9fe6c6c988c593a7/log.json --log-format json --systemd-cgroup state c6741b012b9f3fd1e1be4a7b26e50b0a202f3d1dca3037ce9fe6c6c988c593a7
root     3102983  284910  0 Apr09 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/304a4f1267856fbde5dae2773f0e7aabb6912d4c5d9f53b64891e0d1484dbce3/log.json --log-format json --systemd-cgroup state 304a4f1267856fbde5dae2773f0e7aabb6912d4c5d9f53b64891e0d1484dbce3

strace on the hung process shows its waiting for FUTEX_WAIT_PRIVATE

strace -p 172466
strace: Process 172466 attached
futex(0x55db073239a0, FUTEX_WAIT_PRIVATE, 0, NULL

docker shows container as running:

docker ps | grep 436a87adfc7c
436a87adfc7c        busybox                                                                         "sleep infinity"         7 days ago          Up 7 days                               k8s_probe-test_probe-test-74c994498-6q8mm_default_5ab87edf-b94f-43a4-ad5a-8da7493fd42a_0

but docker inspect 436a87adfc7c hangs

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 4
  • Comments: 23 (6 by maintainers)

Most upvoted comments

@cpuguy83 @thaJeztah [ALPM] upgraded containerd (1.3.4-2 -> 1.4.1-1) It worked! I upgraded containerd to 1.4.1 and now again everything is running stable. Thanks!

I swear ninjas come and change what I type…

1.4.1 has a bug in the v1 shim, which is the only shim Docker uses

probably meant v1.4.0 here; fix for that was in v1.4.1

@cobrafast I think I am in the same situation.

Server: Containers: 16 Running: 14 Paused: 0 Stopped: 2 Images: 155 Server Version: 19.03.12-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: false Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 5.4.63-1-lts Operating System: Arch Linux OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 7.579GiB Name: *** ID: *** Docker Root Dir: /var/lib/docker Debug Mode: false Username: *** Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

trouble started for me after 29th august, when I did an upgrade, which included this: [ALPM] upgraded containerd (1.3.4-2 -> 1.4.0-2)

So today I downgraded containerd back to 1.3.4-2 and everything seems to run stable now. Hopefully this is fixed in a future update.

@MannuSD you look to be running outdated versions of docker, containerd and runc; I recall there were a couple of fixes around health-checks in recent versions

  • current versions of containerd; 1.6.24, runc 1.1.7
  • current version of docker is 24.0.6, but if you have a specific reason to run a 23.0.x version, at least update to the latest patch release

@cobrafast @robertalpha Need to upgrade to containerd 1.4.1 or downgrade to containerd 1.3.7. 1.4.1 has a bug in the v1 shim, which is the only shim Docker uses.

I too think I am affected by this problem (but on apparently different versions as previously reported) as I too see containerd-shim processes stuck on futex(0x55c0263fe388, FUTEX_WAIT_PRIVATE, 0, NULL. All containers with healthchecks become “unhealthy” after a couple hours even though the app running inside is still working fine. Stopping containers fails because the containerd-shim process won’t exit and needs to be killed manually.

Server:
 Containers: 11
  Running: 11
  Paused: 0
  Stopped: 0
 Images: 201
 Server Version: 19.03.12-ce
 Storage Driver: btrfs
  Build Version: Btrfs v5.7 
  Library Version: 102
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.8.5-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 62.8GiB
 Name: ***
 ID: ***
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Edit: Recently performed package upgrades that may be relevant as the problem only started manifesting a few weeks ago for me:

[2020-07-02T12:31:13+0200] [ALPM] upgraded docker-compose (1.26.0-1 -> 1.26.1-1)
[2020-07-03T11:26:19+0200] [ALPM] upgraded docker-compose (1.26.1-1 -> 1.26.2-1)
[2020-07-03T11:26:19+0200] [ALPM] upgraded runc (1.0.0rc90-1 -> 1.0.0rc91-1)
[2020-07-23T13:18:35+0200] [ALPM] upgraded docker (1:19.03.12-1 -> 1:19.03.12-2)
[2020-08-07T14:49:57+0200] [ALPM] upgraded runc (1.0.0rc91-1 -> 1.0.0rc92-1)
[2020-08-19T19:40:26+0200] [ALPM] upgraded containerd (1.3.4-2 -> 1.4.0-2)