kubernetes: application crash due to k8s 1.9.x open the kernel memory accounting by default
when we upgrade the k8s from 1.6.4 to 1.9.0, after a few days, the product environment report the machine is hang and jvm crash in container randomly , we found the cgroup memory css id is not release, when cgroup css id is large than 65535, the machine is hang, we must restart the machine.
we had found runc/libcontainers/memory.go in k8s 1.9.0 had delete the if condition, which cause the kernel memory open by default, but we are using kernel 3.10.0-514.16.1.el7.x86_64, on this version, kernel memory limit is not stable, which leak the cgroup memory leak and application crash randomly
when we run "docker run -d --name test001 --kernel-memory 100M " , docker report WARNING: You specified a kernel memory limit on a kernel older than 4.0. Kernel memory limits are experimental on older kernels, it won’t work as expected and can cause your system to be unstable.
k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go
- if d.config.KernelMemory != 0 {
+ // Only enable kernel memory accouting when this cgroup
+ // is created by libcontainer, otherwise we might get
+ // error when people use `cgroupsPath` to join an existed
+ // cgroup whose kernel memory is not initialized.
if err := EnableKernelMemoryAccounting(path); err != nil {
return err
}
I want to know why kernel memory open by default? can k8s consider the different kernel version?
Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT
Uncomment only one, leave it on its own line:
/kind bug /kind feature
What happened: application crash and cgroup memory leak
What you expected to happen: application stable and cgroup memory doesn’t leak
How to reproduce it (as minimally and precisely as possible): install k8s 1.9.x on kernel 3.10.0-514.16.1.el7.x86_64 machine, and create and delete pod repeatedly, when create more than 65535/3 times , the kubelet report “cgroup no space left on device” error, when the cluster run a few days , the container will crash.
Anything else we need to know?:
Environment: kernel 3.10.0-514.16.1.el7.x86_64
- Kubernetes version (use
kubectl version): k8s 1.9.x - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
- Kernel (e.g.
uname -a): 3.10.0-514.16.1.el7.x86_64 - Install tools: rpm
- Others:
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 39
- Comments: 119 (51 by maintainers)
Commits related to this issue
- vendor/runc: Optionally disable kmem limits See: https://github.com/kubernetes/kubernetes/issues/61937 See: https://github.com/opencontainers/runc/pull/1350 See: https://github.com/moby/moby/issues/2... — committed to scality/kubernetes by NicolasT 6 years ago
- Optionally disable kmem accounting See: https://github.com/scality/kubernetes/commit/b04b0506e49b1b60ea7c8b74a5ca2edbf341cd6c See: https://github.com/kubernetes/kubernetes/issues/61937 See: https://gi... — committed to ryarnyah/runc by ryarnyah 6 years ago
- Optionally disable kmem accounting See: https://github.com/scality/kubernetes/commit/b04b0506e49b1b60ea7c8b74a5ca2edbf341cd6c See: https://github.com/kubernetes/kubernetes/issues/61937 See: https://gi... — committed to ryarnyah/runc by ryarnyah 6 years ago
- Optionally disable kmem accounting See: https://github.com/scality/kubernetes/commit/b04b0506e49b1b60ea7c8b74a5ca2edbf341cd6c See: https://github.com/kubernetes/kubernetes/issues/61937 See: https://gi... — committed to ryarnyah/runc by ryarnyah 6 years ago
- Optionally disable kmem accounting See: https://github.com/scality/kubernetes/commit/b04b0506e49b1b60ea7c8b74a5ca2edbf341cd6c See: https://github.com/kubernetes/kubernetes/issues/61937 See: https://gi... — committed to ryarnyah/runc by ryarnyah 6 years ago
- Optionally disable kmem accounting See: https://github.com/scality/kubernetes/commit/b04b0506e49b1b60ea7c8b74a5ca2edbf341cd6c See: https://github.com/kubernetes/kubernetes/issues/61937 See: https://gi... — committed to ryarnyah/runc by ryarnyah 6 years ago
- Optionally disable kmem accounting See: https://github.com/scality/kubernetes/commit/b04b0506e49b1b60ea7c8b74a5ca2edbf341cd6c See: https://github.com/kubernetes/kubernetes/issues/61937 See... — committed to ryarnyah/runc by ryarnyah 6 years ago
- libcontainer/cgroups: do not enable kmem on broken kernels Commit fe898e7862f94 (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer even if kmem limit is not configure... — committed to kolyshkin/runc by kolyshkin 6 years ago
- libcontainer: enable to compile without kmem Commit fe898e7862f94 (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel ... — committed to kolyshkin/runc by kolyshkin 6 years ago
- libcontainer/cgroups: do not enable kmem on broken kernels Commit fe898e7862f94 (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer even if kmem limit is not configure... — committed to kolyshkin/runc by kolyshkin 6 years ago
- libcontainer: ability to compile without kmem Commit fe898e7862f94 (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel... — committed to kolyshkin/runc by kolyshkin 6 years ago
- libcontainer: ability to compile without kmem Commit fe898e7862f94 (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel... — committed to thaJeztah/runc by kolyshkin 6 years ago
- libcontainer: ability to compile without kmem Commit fe898e7862f94 (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel... — committed to clnperez/runc by kolyshkin 6 years ago
- update docker to 18.09 Ubuntu for consistency, but CentOS / RHEL 7 for the following kmem issues: https://github.com/kubernetes/kubernetes/issues/61937 https://github.com/moby/moby/issues/37722 — committed to dominodatalab/ranchhand by steved 5 years ago
- update docker to 18.09 (#11) * update docker to 18.09 Ubuntu for consistency, but CentOS / RHEL 7 for the following kmem issues: https://github.com/kubernetes/kubernetes/issues/61937 https://gi... — committed to dominodatalab/ranchhand by steved 5 years ago
- kernel: disable `CONFIG_MEMCG_KMEM` This causes kernel memory leaks when using versions of `runc` that unconditionally enable per-cgroup kernel memory resource accounting, leading to systems becoming... — committed to scality/centos-kernel by NicolasT 5 years ago
- libcontainer: ability to compile without kmem Commit fe898e7862f94 (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel... — committed to caruccio/runc by kolyshkin 6 years ago
- Fixed Bugs: It seems that the kernel bug which causes this error is finally fixed now, and will be released in kernel-3.10.0-1075.el7, which is due in RHEL 7.8 http://jira.ten... — committed to tedli/kubernetes by JoshuaAndrew 5 years ago
Not only 1.9, but also 1.10 and master have same issue. This is a very serious issue for production, I think providing a parameter to disable kmem limit is good.
/cc @dchen1107 @thockin any comments for this? Thanks.
Use below test case can reproduce this error: first, make cgroup memory to be full:
then release 99 cgroup memory that can be used next to create:
second, create a new pod on this node. Each pod will create 3 cgroup memory directory. for example:
So when we recreate 100 cgroup memory directory, there will be 4 item failed:
third, delete the test pod. Recreate 100 cgroup memory directory before confirm all test pod’s container are already destroy. The correct result that we expected is only number 100 cgroup memory directory can not be create:
But the incorrect result is all cgroup memory directory created by pod are leaked:
Notice that cgroup memory count already reduce 3 , but they occupy space not release.
Please look at this: https://pingcap.com/blog/try-to-fix-two-linux-kernel-bugs-while-testing-tidb-operator-in-k8s/
Just my 2c here.
If you’re on CentOS, which means kernel 3.10, I think trying to rebuild the kernel to disable kmem accounting is gonna get you in some trouble, like @scott0000’s message about kABI breakage by disabling the config. CentOS/RHEL tends to care about that, for kernel module compatibility, so disabling that option on the kernel config might bring trouble elsewhere. It also means you can’t upgrade the kernel to official ones back, so you’re in this cycle of maintaining your own kernel…
We have only explored and decided to disable kmem accounting because that was the best choice in our situation, but that’s on kernel 4.4 (way more recent than 3.10, so source is cleaner without so many patches) and on a more contained O.S. that doesn’t really allow external kernel modules, so kABI is not that big of a problem for us either.
I think a possible approach is to patch Kubernetes instead to not enable kmem. As pointed out before, this was introduced in PR opencontainers/runc#1350, so reverting that patch in the vendored libcontainer would probably achieve the same result, of not enabling kmem accounting and therefore bypassing the issue. (Note I haven’t tested any of this, I’m just suggesting this as an avenue of exploration.)
And as I pointed out in a previous comment, introducing a
--disable-kmem-limitflag to kubelet to enable that effect (or, more specifically, preventing kmem accounting from being enabled through that libcontainer change) would be a welcome addition, though it’s unclear how that could be accomplished (whether the code in libcontainer could be made conditional etc.) and how many changes to the code would be needed to accomplish that. Also unclear whether we would be willing to backport this all the way to 1.9 or 1.8… But it’s a possibility. Again, if someone would like to work on that, I think it would be welcome, just need someone to do the work on it.Anyways, in short: rebuilding CentOS kernel probably bad, patching the vendored libcontainer in your build probably good and not too hard, introducing new
--disable-kmem-limitbest but hardest to accomplish. Good luck! 😃Cheers! Filipe
@lingyanmeng one workaround is to build the
kubeletwith a command like:The resulting binary should be located under
./_output/dockerized/bin/$GOOS/$GOARCH/kubelet.It seems that the kernel bug which causes this error is finally fixed now, and will be released in
kernel-3.10.0-1075.el7, which is due in RHEL 7.8, but goodness knows when that will be, as RHEL 7.7 only came out on August 6th, ~3 weeks ago.https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101
It would be great if you could contribute that…
From our digging up into it, it seems kubelet is the only process setting kmem limit (runc does not), so in that sense you only need to fix kubelet and not runc.
On the other hand, you probably need a change in libcontainer (part of runc) to make it possible for kubelet to skip setting the kmem limit there (since it seems it was the libcontainer change that triggered this, might need to be made conditional…)
Happy to help with code reviews and further guidance. @dashpole is definitely a good contact as well.
We were able to solve the kernel memory issue with Centos 7.7, Kernel 3.10.0-1062.4.1.el7.x86_64 by setting a kernel parameter: cgroup.memory=nokmem
Example:
After reboot check, if param is set:
I wonder if it will even be possible for RedHat to fix their kernel. https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-414550261 noted that rebuilding the kernel with
CONFIG_MEMCG_KMEMturned off is an ABI change which RedHat promises not to do for the full lifetime of their kernel. Further more, the reason thatCONFIG_MEMCG_KMEMleaks memory in 3.10 was a known issue for this version and it was never intended to be used outside of development. The feature wasn’t finished until much later so for RedHat to backport all the changes to finish the feature may also very well introduce an ABI change (though this is just an assumption) so RedHat may be stuck in this situation.This is why it would be very much appreciated if k8s/runc provided a means to work-around this issue.
follow this recommend:Known Issue - KMEM - MSPH-2018-0006 We recommend that all Mesosphere DC/OS users on RHEL or CentOS 7.X upgrade to Docker 18.09.1, which was released on 01/09/2019. Docker 18.09.1 is a patch from Docker that does not enable kernel memory accounting within the Engine on the RHEL 7.X kernel.
Be attention: you must reboot your machine to disable kernel memory accounting.
Yes I agree, the root cause to this issue is the kernel, but the redhat issue (https://bugzilla.redhat.com/show_bug.cgi?id=1507149) tracking that has been open since 2017-10-27, so if Kubernetes could implement a flag as suggested in https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-391064357, that would probably help ALOT of people out until redhat fix the kernel.
FYI, we managed to recompile the kernel with kmem disabled, and it fixed the issue, however this is not a viable solution, as our product has to be deployed to environments that we don’t own.
@warmchang I’m not affiliated with Redhat but I believe they are working on this. See https://bugzilla.redhat.com/show_bug.cgi?id=1507149
I think I’m in favor of adding a
--disable-kmem-limitcommand-line flag… I guess that means first adding the plumbing through libcontainer to make that possible and then adding the flag to make Kubelet respect that…Indeed, there’s no good way to disable this system-wide except for recompiling the kernel… We’ve recently gone through the effort of rebuilding our 4.4 kernel systems to disable kmem accounting. (While it’s desirable to enable it on systems with 4.13 or 4.14, where the accounting works properly without leaks and the information is useful to provide more precise memory accounting which should help in finding better targets for eviction in system OOMs.)
Cheers, Filipe
@gyliu513 enabling kernel memory accounting in that PR was not intentional. However, we do try and stay close to upstream runc so we can continue to receive bug-fixes and other improvements. The original runc bump in cAdvisor, which required me to update runc in kubernetes/kubernetes was for a bugfix. As pointed out in https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-377736075, the correct work-around here is to disable kernel memory accounting in your kernel.
Thanks @chilicat . I tested
cgroup.memory=nokmemboot option onKernel 3.10.0-1062.4.1.el7.x86_64, yes it works:Heads-up: I also tested on
kernel-3.10.0-957.5.1, the boot option has no effect. i.e.memory.kmem.usage_in_bytesis not zero.@gyliu513
cc @filbranden FYI
@feellifexp the kernel log also have this message after upgrade to k8s 1.9.x
CentOS 7 is a much older kernel than what we test CI on in SIG Node/upstream Kubernetes (currently the 5.4.x series). People are welcome to experiment with kernel parameters and share workarounds for their own distributions/deployments but any support will be best effort.
released on: 2019-10-15
For reference: https://access.redhat.com/errata/RHSA-2019:3055
@qkboy , thanks for your reply. And per my understanding, the workaround is to upgrade the docker to 18.09.1, however, k8s also uses the kmem. I’m wondering if there is a fix for k8s.
Thanks!
@kolyshkin is there documentation for how to use https://github.com/opencontainers/runc/commit/6a2c15596845f6ff5182e2022f38a65e5dfa88eb in k8s?
@pires It can’t help me, any other solution?
#cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core) #uname -r 3.10.0-862.14.4.el7.x86_64
Environment variables approach is absolutely fine too if it works. Are there any plans to create a PR for the kubernetes master branch with the code from https://github.com/scality/kubernetes/commit/b404b050 if it resolves the issue?