Flatcar: Upgrade from Stable 2605.8.0 to 2605.9.0 breaks Kubernetes running with docker container runtime
The current stable release 2605.9.0 has broken all our clusters (AWS / GCP / Bare metal + dev / prod). The issue we observe is that some nodes become NotReady
after a while and traffic going to containers on those nodes times out.
Moreover it seems that runc
processes are stuck in D
state (uninterruptible sleep) and a remedy to make the nodes responsive again is to kill those processes.
We are still in the midst of debugging trying to find the root cause, but in the meantime we thought we would raise an Issue detailing as much as we know, and update it as we go. This might also be useful if anyone is experiencing the same problem but haven’t managed to formalize the Issue yet (like us).
Impact
Kubernetes cluster nodes enter a NotReady
state and containers cannot receive traffic so after a while the cluster eventually breaks down.
Environment and steps to reproduce
- Run a kubernetes cluster on
Flatcar Container Linux by Kinvolk 2605.9.0 (Oklo)
hosts with kubernetes versionv1.19.2
. - Wait until containerd gets stuck and nodes break.
- We haven’t tested if an aggressive rotation of pods can bring this to the surface earlier.
Expected behavior Everything should run as usual after the upgrade
Additional information
Logs from kubelet: At some point we notice that kubelet starts complaining about containerd runtime not responding:
Dec 10 08:49:34 ip-10-66-23-28 kubelet[1295931]: E1210 08:49:34.050798 1295931 kubelet.go:1765] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healhy: pleg was last seen active 3m1.826056813s ago; threshold is 3m0s]
Also we can see a lot of timeouts:
Dec 09 20:28:36 ip-10-66-23-47 kubelet[747034]: E1209 20:28:36.860894 747034 remote_runtime.go:357] UpdateContainerResources "fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context
deadline exceeded
Dec 09 20:28:36 ip-10-66-23-47 kubelet[747034]: E1209 20:28:36.860949 747034 cpu_manager.go:418] [cpumanager] reconcileState: failed to update container (pod: calico-node-dmrmn, container: calico-node, container id: fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15
347ef4304, cpuset: "0-1", error: rpc error: code = DeadlineExceeded desc = context deadline exceeded)
Dec 09 20:28:37 ip-10-66-23-47 kubelet[747034]: E1209 20:28:37.816557 747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-live -bird-live' from runtime service failed: rpc error: code = Deadline
Exceeded desc = context deadline exceeded
Dec 09 20:28:39 ip-10-66-23-47 kubelet[747034]: E1209 20:28:39.937142 747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-ready -bird-ready' from runtime service failed: rpc error: code = Unknow
n desc = operation timeout: context deadline exceeded
Dec 09 20:28:47 ip-10-66-23-47 kubelet[747034]: W1209 20:28:47.221936 747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: graylog-5f7f9ff465-bfdsx, container id: fb82db45342e3b4ba8ae0e9f421d63fc06d2ffba2f897fcdf6e1f0d37854ce0f)
Dec 09 20:28:48 ip-10-66-23-47 kubelet[747034]: W1209 20:28:48.009743 747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: vault-1, container id: bd910527ffde5f1bb6cd1c212653d6af6ffdf39b8185b7df3e93fb22d045f310)
Dec 09 20:28:48 ip-10-66-23-47 kubelet[747034]: W1209 20:28:48.261703 747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: calico-node-dmrmn, container id: bba7f01e56fc3f557ddef9217aadfff5e20d4e5239b98cbb2073a5e739044c97)
Dec 09 20:30:37 ip-10-66-23-47 kubelet[747034]: E1209 20:30:37.817316 747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-live -bird-live' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Dec 09 20:30:39 ip-10-66-23-47 kubelet[747034]: E1209 20:30:39.938398 747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-ready -bird-ready' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Dec 09 20:30:48 ip-10-66-23-47 kubelet[747034]: E1209 20:30:48.261900 747034 remote_runtime.go:357] UpdateContainerResources "fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Looking at processes for a hang container we can see:
root 10263 0.0 0.0 4368 708 ? Ds Dec09 0:00 | \_ /usr/local/bin/runsvdir -P /etc/service/enabled
root 10878 0.0 0.0 4216 808 ? Ds Dec09 0:00 | | \_ runsv felix
root 10884 0.7 0.8 1578988 63452 ? Sl Dec09 8:12 | | | \_ calico-node -felix
root 10879 0.0 0.0 4216 836 ? Ds Dec09 0:00 | | \_ runsv bird
root 11318 0.0 0.0 884 656 ? D Dec09 0:08 | | | \_ bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
root 10880 0.0 0.0 4216 820 ? Ds Dec09 0:00 | | \_ runsv monitor-addresses
root 10885 0.0 0.6 1136340 53668 ? Sl Dec09 0:03 | | | \_ calico-node -monitor-addresses
root 10881 0.0 0.0 4216 740 ? Ds Dec09 0:00 | | \_ runsv confd
root 10886 0.0 0.7 1505000 56496 ? Sl Dec09 0:14 | | | \_ calico-node -confd
root 10882 0.0 0.0 4216 832 ? Ds Dec09 0:00 | | \_ runsv bird6
root 11319 0.0 0.0 744 8 ? D Dec09 0:05 | | | \_ bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
root 10883 0.0 0.0 4216 852 ? Ds Dec09 0:00 | | \_ runsv allocate-tunnel-addrs
root 10888 0.0 0.5 1210072 42616 ? Sl Dec09 0:02 | | \_ calico-node -allocate-tunnel-addrs
root 1111612 0.0 0.2 1232712 22308 ? Sl Dec09 0:00 | \_ runc --root /var/run/docker/runtime-runc/moby --log /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process535500753 --detach --pid-file /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/ff1812165d7c7e84fc1e4dce4e86358c4f244430fea1d0c8445a1123f9e23fcb.pid bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49
root 1111635 0.0 0.2 1085760 18012 ? Dsl Dec09 0:00 | | \_ runc init
root 1111628 8.7 0.2 1233864 21296 ? Sl Dec09 45:51 | \_ runc --root /var/run/docker/runtime-runc/moby --log /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/log.json --log-format json --systemd-cgroup update --resources - bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49
and killing runc processes from the above seems to trigger the node to function properly again.
Note that we are mainly observing this on calico-node pods (as also visible from kubelet logs above) but we will try to reproduce it with other pods.
Also we found this: https://github.com/moby/moby/issues/40817 that describes a somewhat similar behaviour (even though it is quite old and latest stable flatcar comes with a containerd version > 1.4.0 which is blamed there)
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 12
- Comments: 18 (7 by maintainers)
Some more findings.
Firstly, the issue can be reproduced with a very simple exec livenessProbe:
I wasn’t able to reproduce it with an equivalent docker healthcheck though. At least not within a day of soaking it across a number of nodes.
Secondly, I tried running 2605.9.0 with binaries from the official docker 19.03.14 release as laid out here: https://kinvolk.io/docs/flatcar-container-linux/latest/container-runtimes/use-a-custom-docker-or-containerd-version/.
So far the runc hang hasn’t been reproduced on those nodes.
There are quite a few differences between what ships with Flatcar and what comes bundled with the release.
Flatcar:
Docker release:
Perhaps notably, the Go version is different: 1.13.5 vs 1.15.5. There are some comments on the containerd issue around avoiding go 1.15: https://github.com/containerd/containerd/issues/4509#issuecomment-687389617. Perhaps this is relevant?
Finally, the issue doesn’t reproduce for us when we use containerd as the runtime for Kubernetes as laid out here: https://kinvolk.io/docs/flatcar-container-linux/latest/container-runtimes/switching-from-docker-to-containerd-for-kubernetes/.
As we’ll eventually have to switch to containerd anyway, we’re going to push forward with that as a workaround.
The linked containerd issue has a bunch of strategies to get more information regarding what’s going on. Check https://github.com/containerd/containerd/issues/4509#issuecomment-689671931 in particular, which includes a quote on how to create goroutine dumps as well as how to set containerd to generate debug output.
I’m seeing the
shim reaped
/shim started
repeating in my containerd logs too. Our prod API mysteriously stops working until docker is restarted, then it works for around 5-24 hours. We’re not even using kubernetes in front of it.Thanks for the response, I wonder if this is really the same because the above report is about
self_freezing
being turned on while you saw a hung task in demsg.Another update.
We started getting the same runc hangs on exec probes with containerd after enabling systemd cgroups:
We were previously enabling the systemd cgroup driver in docker.
After discovering this in containerd, I tried to reproduce the issues with systemd cgroups disabled in docker and I couldn’t. So it seems like systemd cgroups are significant here.