Flatcar: Upgrade from Stable 2605.8.0 to 2605.9.0 breaks Kubernetes running with docker container runtime

The current stable release 2605.9.0 has broken all our clusters (AWS / GCP / Bare metal + dev / prod). The issue we observe is that some nodes become NotReady after a while and traffic going to containers on those nodes times out. Moreover it seems that runc processes are stuck in D state (uninterruptible sleep) and a remedy to make the nodes responsive again is to kill those processes.

We are still in the midst of debugging trying to find the root cause, but in the meantime we thought we would raise an Issue detailing as much as we know, and update it as we go. This might also be useful if anyone is experiencing the same problem but haven’t managed to formalize the Issue yet (like us).

Impact

Kubernetes cluster nodes enter a NotReady state and containers cannot receive traffic so after a while the cluster eventually breaks down.

Environment and steps to reproduce

  • Run a kubernetes cluster on Flatcar Container Linux by Kinvolk 2605.9.0 (Oklo) hosts with kubernetes version v1.19.2.
  • Wait until containerd gets stuck and nodes break.
  • We haven’t tested if an aggressive rotation of pods can bring this to the surface earlier.

Expected behavior Everything should run as usual after the upgrade

Additional information

Logs from kubelet: At some point we notice that kubelet starts complaining about containerd runtime not responding:

Dec 10 08:49:34 ip-10-66-23-28 kubelet[1295931]: E1210 08:49:34.050798 1295931 kubelet.go:1765] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healhy: pleg was last seen active 3m1.826056813s ago; threshold is 3m0s]

Also we can see a lot of timeouts:

Dec 09 20:28:36 ip-10-66-23-47 kubelet[747034]: E1209 20:28:36.860894  747034 remote_runtime.go:357] UpdateContainerResources "fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context
deadline exceeded
Dec 09 20:28:36 ip-10-66-23-47 kubelet[747034]: E1209 20:28:36.860949  747034 cpu_manager.go:418] [cpumanager] reconcileState: failed to update container (pod: calico-node-dmrmn, container: calico-node, container id: fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15
347ef4304, cpuset: "0-1", error: rpc error: code = DeadlineExceeded desc = context deadline exceeded)
Dec 09 20:28:37 ip-10-66-23-47 kubelet[747034]: E1209 20:28:37.816557  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-live -bird-live' from runtime service failed: rpc error: code = Deadline
Exceeded desc = context deadline exceeded
Dec 09 20:28:39 ip-10-66-23-47 kubelet[747034]: E1209 20:28:39.937142  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-ready -bird-ready' from runtime service failed: rpc error: code = Unknow
n desc = operation timeout: context deadline exceeded
Dec 09 20:28:47 ip-10-66-23-47 kubelet[747034]: W1209 20:28:47.221936  747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: graylog-5f7f9ff465-bfdsx, container id: fb82db45342e3b4ba8ae0e9f421d63fc06d2ffba2f897fcdf6e1f0d37854ce0f)
Dec 09 20:28:48 ip-10-66-23-47 kubelet[747034]: W1209 20:28:48.009743  747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: vault-1, container id: bd910527ffde5f1bb6cd1c212653d6af6ffdf39b8185b7df3e93fb22d045f310)
Dec 09 20:28:48 ip-10-66-23-47 kubelet[747034]: W1209 20:28:48.261703  747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: calico-node-dmrmn, container id: bba7f01e56fc3f557ddef9217aadfff5e20d4e5239b98cbb2073a5e739044c97)
Dec 09 20:30:37 ip-10-66-23-47 kubelet[747034]: E1209 20:30:37.817316  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-live -bird-live' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Dec 09 20:30:39 ip-10-66-23-47 kubelet[747034]: E1209 20:30:39.938398  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-ready -bird-ready' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Dec 09 20:30:48 ip-10-66-23-47 kubelet[747034]: E1209 20:30:48.261900  747034 remote_runtime.go:357] UpdateContainerResources "fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Looking at processes for a hang container we can see:

root       10263  0.0  0.0   4368   708 ?        Ds   Dec09   0:00  |   \_ /usr/local/bin/runsvdir -P /etc/service/enabled
root       10878  0.0  0.0   4216   808 ?        Ds   Dec09   0:00  |   |   \_ runsv felix
root       10884  0.7  0.8 1578988 63452 ?       Sl   Dec09   8:12  |   |   |   \_ calico-node -felix
root       10879  0.0  0.0   4216   836 ?        Ds   Dec09   0:00  |   |   \_ runsv bird
root       11318  0.0  0.0    884   656 ?        D    Dec09   0:08  |   |   |   \_ bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
root       10880  0.0  0.0   4216   820 ?        Ds   Dec09   0:00  |   |   \_ runsv monitor-addresses
root       10885  0.0  0.6 1136340 53668 ?       Sl   Dec09   0:03  |   |   |   \_ calico-node -monitor-addresses
root       10881  0.0  0.0   4216   740 ?        Ds   Dec09   0:00  |   |   \_ runsv confd
root       10886  0.0  0.7 1505000 56496 ?       Sl   Dec09   0:14  |   |   |   \_ calico-node -confd
root       10882  0.0  0.0   4216   832 ?        Ds   Dec09   0:00  |   |   \_ runsv bird6
root       11319  0.0  0.0    744     8 ?        D    Dec09   0:05  |   |   |   \_ bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
root       10883  0.0  0.0   4216   852 ?        Ds   Dec09   0:00  |   |   \_ runsv allocate-tunnel-addrs
root       10888  0.0  0.5 1210072 42616 ?       Sl   Dec09   0:02  |   |       \_ calico-node -allocate-tunnel-addrs
root     1111612  0.0  0.2 1232712 22308 ?       Sl   Dec09   0:00  |   \_ runc --root /var/run/docker/runtime-runc/moby --log /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process535500753 --detach --pid-file /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/ff1812165d7c7e84fc1e4dce4e86358c4f244430fea1d0c8445a1123f9e23fcb.pid bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49
root     1111635  0.0  0.2 1085760 18012 ?       Dsl  Dec09   0:00  |   |   \_ runc init
root     1111628  8.7  0.2 1233864 21296 ?       Sl   Dec09  45:51  |   \_ runc --root /var/run/docker/runtime-runc/moby --log /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/log.json --log-format json --systemd-cgroup update --resources - bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49

and killing runc processes from the above seems to trigger the node to function properly again.

Note that we are mainly observing this on calico-node pods (as also visible from kubelet logs above) but we will try to reproduce it with other pods.

Also we found this: https://github.com/moby/moby/issues/40817 that describes a somewhat similar behaviour (even though it is quite old and latest stable flatcar comes with a containerd version > 1.4.0 which is blamed there)

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 12
  • Comments: 18 (7 by maintainers)

Most upvoted comments

Some more findings.

Firstly, the issue can be reproduced with a very simple exec livenessProbe:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: testcase
spec:
  selector:
    matchLabels:
      app: testcase
  template:
    metadata:
      labels:
        app: testcase
    spec:
      containers:
        - name: testcase
          image: alpine:3.12
          command:
            - sh
            - -c
            - |
              touch /tmp/healthy
              while true; do sleep 86400; done
          livenessProbe:
            exec:
              command:
                - sh
                - -c
                - cat /tmp/healthy
            initialDelaySeconds: 5
            periodSeconds: 5

I wasn’t able to reproduce it with an equivalent docker healthcheck though. At least not within a day of soaking it across a number of nodes.

$ docker run -d --name testcase --health-cmd "cat /tmp/healthy" --health-interval 1s --rm alpine:3.12 sh -c "touch /tmp/healthy; while true; do sleep 86400; done"

Secondly, I tried running 2605.9.0 with binaries from the official docker 19.03.14 release as laid out here: https://kinvolk.io/docs/flatcar-container-linux/latest/container-runtimes/use-a-custom-docker-or-containerd-version/.

So far the runc hang hasn’t been reproduced on those nodes.

There are quite a few differences between what ships with Flatcar and what comes bundled with the release.

Flatcar:

Client:
 Version:           19.03.14
 API version:       1.40
 Go version:        go1.15.5
 Git commit:        f8db474
 Built:             Tue Dec  1 19:10:28 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.14
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.15.5
  Git commit:       f8db474
  Built:            Tue Dec  1 19:10:28 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        6806845b4f638417933f68721e289c9aeda456b1
 runc:
  Version:          1.0.0-rc92+dev.docker-19.03
  GitCommit:        3d68c79de7184b0eba97946d4f478736f46bf207
 docker-init:
  Version:          0.18.0fec3683b971d9c3ef73f284f176672c44b448662
  GitCommit:

Docker release:

Client: Docker Engine - Community
 Version:           19.03.14
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        5eb3275
 Built:             Tue Dec  1 19:14:24 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.14
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       5eb3275
  Built:            Tue Dec  1 19:21:08 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.9
  GitCommit:        ea765aba0d05254012b0b9e595e995c09186427f
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec368

Perhaps notably, the Go version is different: 1.13.5 vs 1.15.5. There are some comments on the containerd issue around avoiding go 1.15: https://github.com/containerd/containerd/issues/4509#issuecomment-687389617. Perhaps this is relevant?

Finally, the issue doesn’t reproduce for us when we use containerd as the runtime for Kubernetes as laid out here: https://kinvolk.io/docs/flatcar-container-linux/latest/container-runtimes/switching-from-docker-to-containerd-for-kubernetes/.

As we’ll eventually have to switch to containerd anyway, we’re going to push forward with that as a workaround.

The linked containerd issue has a bunch of strategies to get more information regarding what’s going on. Check https://github.com/containerd/containerd/issues/4509#issuecomment-689671931 in particular, which includes a quote on how to create goroutine dumps as well as how to set containerd to generate debug output.

I’m seeing the shim reaped / shim started repeating in my containerd logs too. Our prod API mysteriously stops working until docker is restarted, then it works for around 5-24 hours. We’re not even using kubernetes in front of it.

Thanks for the response, I wonder if this is really the same because the above report is about self_freezing being turned on while you saw a hung task in demsg.

Another update.

We started getting the same runc hangs on exec probes with containerd after enabling systemd cgroups:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

We were previously enabling the systemd cgroup driver in docker.

After discovering this in containerd, I tried to reproduce the issues with systemd cgroups disabled in docker and I couldn’t. So it seems like systemd cgroups are significant here.