kubernetes: Cgroup leaking, no space left on /sys/fs/cgroup

What happened: CGroup leaking, and out of kernel memory.

Oct 26 19:07:41  kubelet[1606]: W1026 19:07:41.543128    1606 raw.go:87] Error while processing event ("/sys/fs/cgroup/devices/system.slice/run-r18291965551b44e2bcfc7076348375a5.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/run-r18291965551b44e2bcfc7076348375a5.scope: no space left on device

ls /sys/fs/cgroup/devices/system.slice/run-r* -d | wc
   5920    5920  473577

There are many CGroups under pattern system.slice/run-r${SOMEID}.scope for different categories, and this seems never get cleaned.

Eventually, this leaking cgroup cause all types of instabilities, including but not limit to:

kubectl logs -f report no space left
pod network interrupt/unaccessible

What you expected to happen: Such CGroup should be cleaned up after used.

How to reproduce it (as minimally and precisely as possible): It happens to all of our on-prem kubernetes nodes.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“11”, GitVersion:“v1.11.2”, GitCommit:“bb9ffb1654d4a729bb4cec18ff088eacc153c239”, GitTreeState:“clean”, BuildDate:“2018-08-07T23:08:19Z”, GoVersion:“go1.10.3”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): NAME=“Ubuntu” VERSION=“16.04.2 LTS (Xenial Xerus)” ID=ubuntu ID_LIKE=debian PRETTY_NAME=“Ubuntu 16.04.2 LTS” VERSION_ID=“16.04” HOME_URL=“http://www.ubuntu.com/” SUPPORT_URL=“http://help.ubuntu.com/” BUG_REPORT_URL=“http://bugs.launchpad.net/ubuntu/” VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial
Kernel (e.g. uname -a): Linux 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux (have varoius kernel version)
Install tools:
Others:

/kind bug

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 12
Comments: 34 (7 by maintainers)

Commits related to this issue

Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup... — committed to gravitational/planet by deleted user 4 years ago
Implement workaround to clean up leaking cgroups (#570) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope ... — committed to gravitational/planet by deleted user 4 years ago
Implement workaround to clean up leaking cgroups (#570) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope tha... — committed to gravitational/planet by deleted user 4 years ago
Implement workaround to clean up leaking cgroups (#570) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope tha... — committed to gravitational/planet by deleted user 4 years ago
Implement workaround to clean up leaking cgroups (#570) (#577) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run -... — committed to gravitational/planet by deleted user 4 years ago
Implement workaround to clean up leaking cgroups (#570) (#576) * Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run -... — committed to gravitational/planet by deleted user 4 years ago

Most upvoted comments

@reaperes’s systemd cgroup cleanup code from #64137 seemed the cleanest and most surgical of all the workarounds that I’ve found documented for this and the related issues, so I’ve converted it into a DaemonSet that runs the fix hourly on every node in a cluster. You could set any interval that you like, of course, but the script isn’t very resource intensive and hourly seemed reasonable. It actually takes about a day or so for the CPU loading to become noticeable in my cluster and a week or so for it to crash a node. I’ve been running this for a few days now in my staging cluster and it appears to keep the CPU loading under control.

kubectl apply -f https://github.com/derekrprice/k8s-hacks/blob/master/systemd-cgroup-gc.yaml

+18

derekrprice on Aug 8, 2019

This issue might be connected to https://github.com/google/cadvisor/issues/1581

If you take a closer look, you can verify that the problem is inside the function inotify_add_watch.

The default inotify limit on ubuntu is 8192, which can be the limiting factor here. So I decided to test increasing the limit. I ran this command on one of my cluster nodes:

$ sudo sysctl fs.inotify.max_user_watches=524288

After that I kept watching journalctl -f

In my case the error messages disappered.

@c-nuro Can you test it on your system?

EDIT: I deployed the following DaemonSet to my Kubernetes and the problem is gone.

apiVersion: "extensions/v1beta1"
kind: "DaemonSet"
metadata:
  name: "sysctl"
  namespace: "default"
spec:
  template:
    metadata:
      labels:
        app: "sysctl"
    spec:
      containers:
        - name: "sysctl"
          image: "busybox:latest"
          resources:
            limits:
              cpu: "10m"
              memory: "8Mi"
            requests:
              cpu: "10m"
              memory: "8Mi"
          securityContext:
            privileged: true
          command:
            - "/bin/sh"
            - "-c"
            - |
              set -o errexit
              set -o xtrace
              while sysctl -w fs.inotify.max_user_watches=525000
              do
                sleep 60s
              done

jeff1985 on Mar 4, 2019

So, we’re also having this problem and via another post there’s an inotify_watcher.sh script

#!/usr/bin/env bash
#
# Copyright 2018 (c) Yousong Zhou
#
# This script can be used to debug "no space left on device due to inotify
# "max_user_watches" limit".  It will output processes using inotify methods
# for watching file system activities, along with HOW MANY directories each
# inotify fd watches
#
# A temporary method of working around the said issue above, tune up the limit.
# It's a per-user limit
#
#       sudo sysctl fs.inotify.max_user_watches=81920
#
# In case you also wonder why "sudo systemctl restart sshd" notifies inotify
# errors, it's from blue systemd-tty-ask-password-agent
#
#       execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-passwor"..., "--watch"], [/* 16 vars */]) = 0
#       inotify_init1(O_CLOEXEC)                = 4
#       inotify_add_watch(4, "/run/systemd/ask-password", IN_CLOSE_WRITE|IN_MOVED_TO) = -1 ENOSPC (No space left on device)
#
# Sample output
#
#       [yunion@titan yousong]$ sudo bash a.sh  | column -t
#       systemd          /usr/lib/systemd/systemd          1      /proc/1/fdinfo/10     1
#       systemd          /usr/lib/systemd/systemd          1      /proc/1/fdinfo/14     4
#       systemd          /usr/lib/systemd/systemd          1      /proc/1/fdinfo/20     4
#       systemd-udevd    /usr/lib/systemd/systemd-udevd    689    /proc/689/fdinfo/7    4
#       NetworkManager   /usr/sbin/NetworkManager          914    /proc/914/fdinfo/10   5
#       NetworkManager   /usr/sbin/NetworkManager          914    /proc/914/fdinfo/11   4
#       crond            /usr/sbin/crond                   939    /proc/939/fdinfo/5    3
#       rsyslogd         /usr/sbin/rsyslogd                1212   /proc/1212/fdinfo/3   2
#       kube-controller  /usr/bin/kube-controller-manager  4934   /proc/4934/fdinfo/8   1
#       kubelet          /usr/bin/kubelet                  4955   /proc/4955/fdinfo/12  0
#       kubelet          /usr/bin/kubelet                  4955   /proc/4955/fdinfo/17  1
#       kubelet          /usr/bin/kubelet                  4955   /proc/4955/fdinfo/26  51494
#       journalctl       /usr/bin/journalctl               13151  /proc/13151/fdinfo/3  2
#       sdnagent         /opt/yunion/bin/sdnagent          20558  /proc/20558/fdinfo/7  90
#       systemd-udevd    /usr/lib/systemd/systemd-udevd    46019  /proc/46019/fdinfo/7  4
#       systemd-udevd    /usr/lib/systemd/systemd-udevd    46020  /proc/46020/fdinfo/7  4
#
# The script is adapted from https://stackoverflow.com/questions/13758877/how-do-i-find-out-what-inotify-watches-have-been-registered/48938640#48938640
#
set -o errexit
set -o pipefail
lsof +c 0 -n -P -u root \
        | awk '/inotify$/ { gsub(/[urw]$/,"",$4); print $1" "$2" "$4 }' \
        | while read name pid fd; do \
                exe="$(readlink -f /proc/$pid/exe || echo n/a)"; \
                fdinfo="/proc/$pid/fdinfo/$fd" ; \
                count="$(grep -c inotify "$fdinfo" || true)"; \
                echo "$name $exe $pid $fdinfo $count"; \
        done

Output of a system which experiencing this issue is…

# sh inotify_watchers.sh
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/14 4
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/20 4
systemd-udevd /usr/lib/systemd/systemd-udevd 5029 /proc/5029/fdinfo/7 9
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/10 5
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/11 4
crond /usr/sbin/crond 9909 /proc/9909/fdinfo/5 3
rsyslogd /usr/sbin/rsyslogd 10275 /proc/10275/fdinfo/3 2
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/6 1
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/11 0
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/15 1
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/20 71987

So something in kubelet is watching a lot of files… almost 72k to be exact!

A comparison from another host which is behaving - its < 1 k

# sh inotify_watchers.sh
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/15 4
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/16 4
systemd-udevd /usr/lib/systemd/systemd-udevd 1900 /proc/1900/fdinfo/7 3
rsyslogd /usr/sbin/rsyslogd 4053 /proc/4053/fdinfo/3 2
crond /usr/sbin/crond 4134 /proc/4134/fdinfo/5 3
grafana-watcher /usr/bin/grafana-watcher 27945 /proc/27945/fdinfo/3 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/5 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/10 0
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/16 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/24 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/30 780

What I did notice is that after kubelet successfully re-started(after i increase fs.inotify.max_user_watches=524288, which is very excessive IMO) and I restarted a pod which was in a bad state, over time the watches decreased significantly and this is the same output ~10 mins laster

sh inotify_watchers.sh
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/14 4
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/20 4
systemd-udevd /usr/lib/systemd/systemd-udevd 5029 /proc/5029/fdinfo/7 9
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/10 5
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/11 4
crond /usr/sbin/crond 9909 /proc/9909/fdinfo/5 3
rsyslogd /usr/sbin/rsyslogd 10275 /proc/10275/fdinfo/3 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/49 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/51 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/53 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/55 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/57 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/59 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/61 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/63 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/65 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/67 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/69 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/80 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/83 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/84 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/118 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/120 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/154 2
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/6 1
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/7 0
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/10 352
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/16 1

What I don’t know how to trace is what caused the huge spike in the kubelet pid/inotify_watchers?

DLV111 on Jun 11, 2019

So, be careful, i guess in some solutions solution above ^ of increasing watch allotment , can be a bandaid that might cause #64137 to occur, ironically. Hence cross referencing these issues to each other as they are closely related (that is, i think certain types of Cgroup leaking is closely related it seems to kubelet CPU hogging) … for specs, seeing this on 40 core, centos hardware.

jayunit100 on May 3, 2019

Setting sysctl fs.inotify.max_user_watches=524288 seems to have solved the issue for now for me. We use flexVolume. Any news on a permanent fix for this?

joberget on Mar 27, 2019

I noticed the pattern of cgroup used for mounting pod volumes.

The error from rpcbind is unrelated to this issue, but the output saying it’s running with this pattern of cgroup.

Can someone who works on volume mounts take a look?

Mount failed: Mount issued for NFS V3 but unable to run rpcbind:
 Output: rpcbind: another rpcbind is already running. Aborting

c-nuro on Nov 20, 2018