kubernetes: Cgroup leaking, no space left on /sys/fs/cgroup
What happened: CGroup leaking, and out of kernel memory.
Oct 26 19:07:41 kubelet[1606]: W1026 19:07:41.543128 1606 raw.go:87] Error while processing event ("/sys/fs/cgroup/devices/system.slice/run-r18291965551b44e2bcfc7076348375a5.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/run-r18291965551b44e2bcfc7076348375a5.scope: no space left on device
ls /sys/fs/cgroup/devices/system.slice/run-r* -d | wc
5920 5920 473577
There are many CGroups under pattern system.slice/run-r${SOMEID}.scope for different categories, and this seems never get cleaned.
Eventually, this leaking cgroup cause all types of instabilities, including but not limit to:
kubectl logs -freport no space left- pod network interrupt/unaccessible
What you expected to happen: Such CGroup should be cleaned up after used.
How to reproduce it (as minimally and precisely as possible): It happens to all of our on-prem kubernetes nodes.
Anything else we need to know?:
Environment:
-
Kubernetes version (use
kubectl version): Server Version: version.Info{Major:“1”, Minor:“11”, GitVersion:“v1.11.2”, GitCommit:“bb9ffb1654d4a729bb4cec18ff088eacc153c239”, GitTreeState:“clean”, BuildDate:“2018-08-07T23:08:19Z”, GoVersion:“go1.10.3”, Compiler:“gc”, Platform:“linux/amd64”} -
Cloud provider or hardware configuration:
-
OS (e.g. from /etc/os-release): NAME=“Ubuntu” VERSION=“16.04.2 LTS (Xenial Xerus)” ID=ubuntu ID_LIKE=debian PRETTY_NAME=“Ubuntu 16.04.2 LTS” VERSION_ID=“16.04” HOME_URL=“http://www.ubuntu.com/” SUPPORT_URL=“http://help.ubuntu.com/” BUG_REPORT_URL=“http://bugs.launchpad.net/ubuntu/” VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial
-
Kernel (e.g.
uname -a): Linux 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux (have varoius kernel version) -
Install tools:
-
Others:
/kind bug
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 12
- Comments: 34 (7 by maintainers)
Commits related to this issue
- Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup... — committed to gravitational/planet by deleted user 4 years ago
- Implement workaround to clean up leaking cgroups (#570) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope ... — committed to gravitational/planet by deleted user 4 years ago
- Implement workaround to clean up leaking cgroups (#570) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope tha... — committed to gravitational/planet by deleted user 4 years ago
- Implement workaround to clean up leaking cgroups (#570) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope tha... — committed to gravitational/planet by deleted user 4 years ago
- Implement workaround to clean up leaking cgroups (#570) (#577) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run -... — committed to gravitational/planet by deleted user 4 years ago
- Implement workaround to clean up leaking cgroups (#570) (#576) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run -... — committed to gravitational/planet by deleted user 4 years ago
@reaperes’s systemd cgroup cleanup code from #64137 seemed the cleanest and most surgical of all the workarounds that I’ve found documented for this and the related issues, so I’ve converted it into a DaemonSet that runs the fix hourly on every node in a cluster. You could set any interval that you like, of course, but the script isn’t very resource intensive and hourly seemed reasonable. It actually takes about a day or so for the CPU loading to become noticeable in my cluster and a week or so for it to crash a node. I’ve been running this for a few days now in my staging cluster and it appears to keep the CPU loading under control.
kubectl apply -fhttps://github.com/derekrprice/k8s-hacks/blob/master/systemd-cgroup-gc.yamlThis issue might be connected to https://github.com/google/cadvisor/issues/1581
If you take a closer look, you can verify that the problem is inside the function inotify_add_watch.
The default inotify limit on ubuntu is 8192, which can be the limiting factor here. So I decided to test increasing the limit. I ran this command on one of my cluster nodes:
$ sudo sysctl fs.inotify.max_user_watches=524288
After that I kept watching journalctl -f
In my case the error messages disappered.
@c-nuro Can you test it on your system?
EDIT: I deployed the following DaemonSet to my Kubernetes and the problem is gone.
So, we’re also having this problem and via another post there’s an inotify_watcher.sh script
Output of a system which experiencing this issue is…
So something in kubelet is watching a lot of files… almost 72k to be exact!
A comparison from another host which is behaving - its < 1 k
What I did notice is that after kubelet successfully re-started(after i increase fs.inotify.max_user_watches=524288, which is very excessive IMO) and I restarted a pod which was in a bad state, over time the watches decreased significantly and this is the same output ~10 mins laster
What I don’t know how to trace is what caused the huge spike in the kubelet pid/inotify_watchers?
So, be careful, i guess in some solutions solution above ^ of increasing watch allotment , can be a bandaid that might cause #64137 to occur, ironically. Hence cross referencing these issues to each other as they are closely related (that is, i think certain types of Cgroup leaking is closely related it seems to kubelet CPU hogging) … for specs, seeing this on 40 core, centos hardware.
Setting sysctl fs.inotify.max_user_watches=524288 seems to have solved the issue for now for me. We use flexVolume. Any news on a permanent fix for this?
I noticed the pattern of cgroup used for mounting pod volumes.
The error from rpcbind is unrelated to this issue, but the output saying it’s running with this pattern of cgroup.
Can someone who works on volume mounts take a look?