kubernetes: kubelet "failed to initialize top level QOS containers" on cgroup mount leaks
What happened:
Sometimes we can see kubelets stuck in a restarting loop and never getting healthy again.
The kubelet at the end fails with
May 08 11:44:36.412182 kube-node01 kubelet[6613]: F0508 11:44:36.412096 6613 kubelet.go:1380] Failed to start ContainerManager failed to initialize top level QOS containers: root container [kubepods] doesn't exist
The issue seems to be a leak of mount information from our csi-nodeplugin
pod.
The issue is that /proc/self/mountinfo
contains cgroup mount information from the pod which should not be the case. Because of that the kubelet tries to use the path of the leaked mount as root cgroup, which is wrong.
Because of that the pod/container in Terminating
did leak cgroup mount information to the host.
Some time later the kubelet had a restart and was not able to recover
What you expected to happen:
Kubelet restarts and recovers.
How to reproduce it (as minimally and precisely as possible):
On a node:
# create target dummy dir
mkdir -p /run/foo/rootfs/sys/fs/cgroup/systemd
# get a container cgroup dir in /sys/fs/cgroup/systemd/kubepods
POD=$(find /sys/fs/cgroup/systemd/kubepods/ -maxdepth 1 -name "pod*" | head -n 1)
SRCDIR="$POD/$(ls -1 $POD | grep -v -E -e '^cgroup|^notify|^tasks' | head -n 1)"
mount -o bind "$SRCDIR" "/run/foo/rootfs/sys/fs/cgroup/systemd"
systemctl restart kubelet
Fix afterwards:
umount "/run/foo/rootfs/sys/fs/cgroup/systemd"
Anything else we need to know?:
I was able to “fix” it by unmounting the leaked cgroups:
cat /proc/mounts | grep cgroup | grep containerd | awk '{print $2}' | grep -v 'cgroup$' | xargs umount
cat /proc/mounts | grep cgroup | grep containerd | awk '{print $2}' | xargs umount
-
Maybe related to #49835
-
Could get partially fixed similar to https://github.com/google/cadvisor/pull/1792/files in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/helpers_linux.go#L198
I did already test that but fixing this resulted in a ready kubelet, not able to reconcile pods:
Normal Created 9s (x2 over 10s) kubelet, kube-node03 Created container coredns Warning Failed 8s (x2 over 10s) kubelet, kube-node03 Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"cgroup\\\" to rootfs \\\"/run/containerd/io.containerd.runtime.v1.linux/k8s.io/coredns/rootfs\\\" at \\\"/sys/fs/cgroup\\\" caused \\\"stat /run/foo/rootfs/sys/fs/podf4048c8d-fe5f-42e6-8899-8c9923f99c63/coredns: no such file or directory\\\"\"": unknown Warning BackOff 5s (x3 over 8s) kubelet, kube-node03 Back-off restarting failed container
In this case containerd still fails but I think this needs to be addressed in containerd.
Environment:
- Kubernetes version (use
kubectl version
):Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.5-dirty", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"dirty", BuildDate:"2020-04-17T03:37:03Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration:
- Openstack
- OS (e.g:
cat /etc/os-release
):root@kube-node01:~# cat /etc/os-release | grep PRETTY PRETTY_NAME="Ubuntu 19.10"
- Kernel (e.g.
uname -a
):Linux kube-node01 5.3.0-46-generic #38-Ubuntu SMP Fri Mar 27 17:37:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
- cinder csi
- Network plugin and version (if this is a network-related bug):
- calico-node
- Others:
- we using
cgroupfs
- we using
Example content of non-working content in /proc/self/mountinfo
:
On healthy nodes there are no entries having /run/containerd/io.containerd.runtime.v1.linux/
in the paths
root@kube-node01:~# cat /proc/self/mountinfo | grep cgroup
30 21 0:26 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:9 - tmpfs tmpfs ro,mode=755
31 30 0:27 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:10 - cgroup2 cgroup2 rw
32 30 0:28 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,xattr,name=systemd
35 30 0:31 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,rdma
36 30 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,cpu,cpuacct
37 30 0:33 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,freezer
38 30 0:34 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,blkio
39 30 0:35 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,pids
40 30 0:36 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,cpuset
41 30 0:37 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memory
42 30 0:38 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,net_cls,net_prio
43 30 0:39 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,hugetlb
44 30 0:40 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,perf_event
45 30 0:41 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:25 - cgroup cgroup rw,devices
945 25 0:159 / /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:606 - tmpfs tmpfs rw,mode=755
979 945 0:28 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,xattr,name=systemd
1471 945 0:31 / /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,rdma
1696 945 0:32 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,cpu,cpuacct
1714 945 0:33 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,freezer
2099 945 0:34 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,blkio
2345 945 0:35 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,pids
2390 945 0:36 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,cpuset
2409 945 0:37 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memory
2428 945 0:38 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,net_cls,net_prio
2447 945 0:39 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,hugetlb
2466 945 0:40 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,perf_event
2485 945 0:41 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:25 - cgroup cgroup rw,devices
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (12 by maintainers)
For a temporary workaround, we put the following in for kubelet, just so it can come up on its own.
We are not using the node-allocatable constraint so setting this is ok for us.