kubernetes: application crash due to k8s 1.9.x open the kernel memory accounting by default

when we upgrade the k8s from 1.6.4 to 1.9.0, after a few days, the product environment report the machine is hang and jvm crash in container randomly , we found the cgroup memory css id is not release, when cgroup css id is large than 65535, the machine is hang, we must restart the machine.

we had found runc/libcontainers/memory.go in k8s 1.9.0 had delete the if condition, which cause the kernel memory open by default, but we are using kernel 3.10.0-514.16.1.el7.x86_64, on this version, kernel memory limit is not stable, which leak the cgroup memory leak and application crash randomly

when we run "docker run -d --name test001 --kernel-memory 100M " , docker report WARNING: You specified a kernel memory limit on a kernel older than 4.0. Kernel memory limits are experimental on older kernels, it won’t work as expected and can cause your system to be unstable.

k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go

-		if d.config.KernelMemory != 0 {
+			// Only enable kernel memory accouting when this cgroup
+			// is created by libcontainer, otherwise we might get
+			// error when people use `cgroupsPath` to join an existed
+			// cgroup whose kernel memory is not initialized.
 			if err := EnableKernelMemoryAccounting(path); err != nil {
 				return err
 			}

I want to know why kernel memory open by default? can k8s consider the different kernel version?

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened: application crash and cgroup memory leak

What you expected to happen: application stable and cgroup memory doesn’t leak

How to reproduce it (as minimally and precisely as possible): install k8s 1.9.x on kernel 3.10.0-514.16.1.el7.x86_64 machine, and create and delete pod repeatedly, when create more than 65535/3 times , the kubelet report “cgroup no space left on device” error, when the cluster run a few days , the container will crash.

Anything else we need to know?:

Environment: kernel 3.10.0-514.16.1.el7.x86_64

  • Kubernetes version (use kubectl version): k8s 1.9.x
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a): 3.10.0-514.16.1.el7.x86_64
  • Install tools: rpm
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 39
  • Comments: 119 (51 by maintainers)

Commits related to this issue

Most upvoted comments

Not only 1.9, but also 1.10 and master have same issue. This is a very serious issue for production, I think providing a parameter to disable kmem limit is good.

/cc @dchen1107 @thockin any comments for this? Thanks.

Use below test case can reproduce this error: first, make cgroup memory to be full:

# uname -r
3.10.0-514.10.2.el7.x86_64
# kubelet --version
Kubernetes 1.9.0
# mkdir /sys/fs/cgroup/memory/test
# for i in `seq 1 65535`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done
# cat /proc/cgroups |grep memory
memory  11      65535   1

then release 99 cgroup memory that can be used next to create:

# for i in `seq 1 100`;do rmdir /sys/fs/cgroup/memory/test/test-${i} 2>/dev/null 1>&2; done 
# mkdir /sys/fs/cgroup/memory/stress/
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done 
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device <-- notice number 100 can not create
# for i in `seq 1 100`;do rmdir /sys/fs/cgroup/memory/test/test-${i}; done <-- delete 100 cgroup memory
# cat /proc/cgroups |grep memory
memory  11      65436   1

second, create a new pod on this node. Each pod will create 3 cgroup memory directory. for example:

# ll /sys/fs/cgroup/memory/kubepods/pod0f6c3c27-3186-11e8-afd3-fa163ecf2dce/
total 0
drwxr-xr-x 2 root root 0 Mar 27 14:14 6d1af9898c7f8d58066d0edb52e4d548d5a27e3c0d138775e9a3ddfa2b16ac2b
drwxr-xr-x 2 root root 0 Mar 27 14:14 8a65cb234767a02e130c162e8d5f4a0a92e345bfef6b4b664b39e7d035c63d1

So when we recreate 100 cgroup memory directory, there will be 4 item failed:

# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done    
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-97’: No space left on device <-- 3 directory used by pod
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-98’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-99’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device
# cat /proc/cgroups 
memory  11      65439   1

third, delete the test pod. Recreate 100 cgroup memory directory before confirm all test pod’s container are already destroy. The correct result that we expected is only number 100 cgroup memory directory can not be create:

# cat /proc/cgroups 
memory  11      65436   1
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done 
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device

But the incorrect result is all cgroup memory directory created by pod are leaked:

# cat /proc/cgroups 
memory  11      65436   1 <-- now cgroup memory total directory
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done    
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-97’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-98’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-99’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device

Notice that cgroup memory count already reduce 3 , but they occupy space not release.

Just my 2c here.

If you’re on CentOS, which means kernel 3.10, I think trying to rebuild the kernel to disable kmem accounting is gonna get you in some trouble, like @scott0000’s message about kABI breakage by disabling the config. CentOS/RHEL tends to care about that, for kernel module compatibility, so disabling that option on the kernel config might bring trouble elsewhere. It also means you can’t upgrade the kernel to official ones back, so you’re in this cycle of maintaining your own kernel…

We have only explored and decided to disable kmem accounting because that was the best choice in our situation, but that’s on kernel 4.4 (way more recent than 3.10, so source is cleaner without so many patches) and on a more contained O.S. that doesn’t really allow external kernel modules, so kABI is not that big of a problem for us either.

I think a possible approach is to patch Kubernetes instead to not enable kmem. As pointed out before, this was introduced in PR opencontainers/runc#1350, so reverting that patch in the vendored libcontainer would probably achieve the same result, of not enabling kmem accounting and therefore bypassing the issue. (Note I haven’t tested any of this, I’m just suggesting this as an avenue of exploration.)

And as I pointed out in a previous comment, introducing a --disable-kmem-limit flag to kubelet to enable that effect (or, more specifically, preventing kmem accounting from being enabled through that libcontainer change) would be a welcome addition, though it’s unclear how that could be accomplished (whether the code in libcontainer could be made conditional etc.) and how many changes to the code would be needed to accomplish that. Also unclear whether we would be willing to backport this all the way to 1.9 or 1.8… But it’s a possibility. Again, if someone would like to work on that, I think it would be welcome, just need someone to do the work on it.

Anyways, in short: rebuilding CentOS kernel probably bad, patching the vendored libcontainer in your build probably good and not too hard, introducing new --disable-kmem-limit best but hardest to accomplish. Good luck! 😃

Cheers! Filipe

@lingyanmeng one workaround is to build the kubelet with a command like:

git clone --branch v1.14.1 --single-branch --depth 1 https://github.com/kubernetes/kubernetes
cd kubernetes
KUBE_GIT_VERSION=v1.14.1 ./build/run.sh make kubelet GOFLAGS="-tags=nokmem"

The resulting binary should be located under ./_output/dockerized/bin/$GOOS/$GOARCH/kubelet.

It seems that the kernel bug which causes this error is finally fixed now, and will be released in kernel-3.10.0-1075.el7, which is due in RHEL 7.8, but goodness knows when that will be, as RHEL 7.7 only came out on August 6th, ~3 weeks ago.

https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101

It would be great if you could contribute that…

From our digging up into it, it seems kubelet is the only process setting kmem limit (runc does not), so in that sense you only need to fix kubelet and not runc.

On the other hand, you probably need a change in libcontainer (part of runc) to make it possible for kubelet to skip setting the kmem limit there (since it seems it was the libcontainer change that triggered this, might need to be made conditional…)

Happy to help with code reviews and further guidance. @dashpole is definitely a good contact as well.

We were able to solve the kernel memory issue with Centos 7.7, Kernel 3.10.0-1062.4.1.el7.x86_64 by setting a kernel parameter: cgroup.memory=nokmem

Example:

grubby --args=cgroup.memory=nokmem --update-kernel /boot/vmlinuz-3.10.0-1062.4.1.el7.x86_64 
reboot

After reboot check, if param is set:

cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1062.4.1.el7.x86_64 root=/dev/mapper/cl-root ro nousbstorage crashkernel=auto rd.lvm.lv=cl/root rhgb quiet cgroup.memory=nokmem

that would probably help ALOT of people out until redhat fix the kernel.

I wonder if it will even be possible for RedHat to fix their kernel. https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-414550261 noted that rebuilding the kernel with CONFIG_MEMCG_KMEM turned off is an ABI change which RedHat promises not to do for the full lifetime of their kernel. Further more, the reason that CONFIG_MEMCG_KMEM leaks memory in 3.10 was a known issue for this version and it was never intended to be used outside of development. The feature wasn’t finished until much later so for RedHat to backport all the changes to finish the feature may also very well introduce an ABI change (though this is just an assumption) so RedHat may be stuck in this situation.

This is why it would be very much appreciated if k8s/runc provided a means to work-around this issue.

May I ask if it’s fixed or not in k8s 1.13? How can I prevent this issue in my CentOS 7.5 with 3.10 kernel?

Thanks!

follow this recommend:Known Issue - KMEM - MSPH-2018-0006 We recommend that all Mesosphere DC/OS users on RHEL or CentOS 7.X upgrade to Docker 18.09.1, which was released on 01/09/2019. Docker 18.09.1 is a patch from Docker that does not enable kernel memory accounting within the Engine on the RHEL 7.X kernel.

Be attention: you must reboot your machine to disable kernel memory accounting.

Yes I agree, the root cause to this issue is the kernel, but the redhat issue (https://bugzilla.redhat.com/show_bug.cgi?id=1507149) tracking that has been open since 2017-10-27, so if Kubernetes could implement a flag as suggested in https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-391064357, that would probably help ALOT of people out until redhat fix the kernel.

FYI, we managed to recompile the kernel with kmem disabled, and it fixed the issue, however this is not a viable solution, as our product has to be deployed to environments that we don’t own.

@warmchang I’m not affiliated with Redhat but I believe they are working on this. See https://bugzilla.redhat.com/show_bug.cgi?id=1507149

I think I’m in favor of adding a --disable-kmem-limit command-line flag… I guess that means first adding the plumbing through libcontainer to make that possible and then adding the flag to make Kubelet respect that…

Indeed, there’s no good way to disable this system-wide except for recompiling the kernel… We’ve recently gone through the effort of rebuilding our 4.4 kernel systems to disable kmem accounting. (While it’s desirable to enable it on systems with 4.13 or 4.14, where the accounting works properly without leaks and the information is useful to provide more precise memory accounting which should help in finding better targets for eviction in system OOMs.)

Cheers, Filipe

@gyliu513 enabling kernel memory accounting in that PR was not intentional. However, we do try and stay close to upstream runc so we can continue to receive bug-fixes and other improvements. The original runc bump in cAdvisor, which required me to update runc in kubernetes/kubernetes was for a bugfix. As pointed out in https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-377736075, the correct work-around here is to disable kernel memory accounting in your kernel.

Thanks @chilicat . I tested cgroup.memory=nokmem boot option on Kernel 3.10.0-1062.4.1.el7.x86_64, yes it works:

# find /sys/fs/cgroup/memory/kubepods -name memory.kmem.usage_in_bytes  -exec cat {} \;
0
0
0
0
0
0
0
0
0
...

Heads-up: I also tested on kernel-3.10.0-957.5.1, the boot option has no effect. i.e. memory.kmem.usage_in_bytes is not zero.

@gyliu513

  1. Follow this comment to disable kernel memory accounting by recompiling the kernel: https://github.com/opencontainers/runc/issues/1725#issuecomment-380428228.
  2. I did some testing with kernel memory accounting disabled, and found that it made a relatively small impact on the ability of the kubelet to manage memory. I would recommend increasing --eviction-hard’s memory.available parameter by 50Mi when disabling kernel memory accounting.

cc @filbranden FYI

@feellifexp the kernel log also have this message after upgrade to k8s 1.9.x

kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8020)

CentOS 7 is a much older kernel than what we test CI on in SIG Node/upstream Kubernetes (currently the 5.4.x series). People are welcome to experiment with kernel parameters and share workarounds for their own distributions/deployments but any support will be best effort.

but goodness knows when that will be, as RHEL 7.7 only came out on August 6th, ~3 weeks ago.

released on: 2019-10-15

For reference: https://access.redhat.com/errata/RHSA-2019:3055

@qkboy , thanks for your reply. And per my understanding, the workaround is to upgrade the docker to 18.09.1, however, k8s also uses the kmem. I’m wondering if there is a fix for k8s.

Thanks!

CentOS 7.5 fixes it as far as I can test.

@pires It can’t help me, any other solution?

#cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core) #uname -r 3.10.0-862.14.4.el7.x86_64

Environment variables approach is absolutely fine too if it works. Are there any plans to create a PR for the kubernetes master branch with the code from https://github.com/scality/kubernetes/commit/b404b050 if it resolves the issue?