kubernetes: setting quotas while creating a pod fails
What happened: A pod of my deployment suddenly doesn’t come up again and hangs in “CrashLoopBackOff”
What you expected to happen: When a pod of a deployment is killed for some reason, I expect it to come up again.
How to reproduce it (as minimally and precisely as possible): This only happens sometimes and cannot be reproduced dependably, I am still trying …
Anything else we need to know?:
kubectl describe pod shows this error-message:
OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:367: setting cgroup config for procHooks process caused \\\"failed to write 200000 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/container.slice/kubepods/burstable/pod6ba8075b-132e-11e9-ab2e-246e9674888c/a5f752a5a36fafeab7f16beb4763521cf2370efc3ba961e85a8ac1faef721b48/cpu.cfs_quota_us: invalid argument\\\"\"": unknown
Environment:
- Kubernetes version (use
kubectl version
): Server Version: version.Info{Major:“1”, Minor:“12”, GitVersion:“v1.12.3”, GitCommit:“435f92c719f279a3a67808c80521ea17d5715c66”, GitTreeState:“clean”, BuildDate:“2018-11-26T12:46:57Z”, GoVersion:“go1.10.4”, Compiler:“gc”, Platform:“linux/amd64”} - Cloud provider or hardware configuration: Hardware
- OS (e.g. from /etc/os-release): coreos 1967.3.0
- Kernel (e.g.
uname -a
): 4.14.88 - Install tools: terraform
- Others:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 37 (3 by maintainers)
Commits related to this issue
- Add fix-cfs.py to mitigate https://github.com/kubernetes/kubernetes/issues/72878 — committed to xueweiz/scripts by deleted user 5 years ago
Hi @liqlin2015
Yeah reboot would for sure mitigate the problem. Until you hit it again after the reboot 😃
Another mitigation that we used and has worked for us is this script: https://github.com/xueweiz/scripts/blob/master/fix-cfs/fix-cfs.py
To verify whether it works for you, you can find a broken node and SSH into it, then do:
Please share the output logs with me if it didn’t work for you. If it worked, then we can consider deploying it in production as a mitigation.
You should already observed that on a broken node, you cannot create any additional pods through k8s. So you cannot fix a broken node by deploying any DaemonSet.
But what you can do, is to run my script in a DaemonSet, give it proper privileges and mount points, and run it continuously (e.g. once per minute). That should help you repair your nodes once the problem happens quickly. If you verified that the script works for you, we can together write a DaemonSet for deploying it.
Yes it’s reproducible. But I’m not sure whether you’ll like it or not…It’s pretty clean and simple (if you know what is it trying to do), however a bit complicated…
The bug is caused by this kernel logic, where kernel may scale the CFS quota and period by 147/128. If that happens on a Pod cgroup, it would cause the inconsistency between the pod and its containers.
So to reproduce it, you need to find a long-running Pod (on GKE,
prometheus-to-sd
would do), then do this:The you should SSH into the node (
gcloud compute ssh $node_name
), and run below commands:You should see the period_us for both POD and CONTAINER is 100000. And you should see the quota_us for both POD and CONTAINER should be equal, and should be between [1000, 100000]. If not, find another Pod 😃. In GKE,
prometheus-to-sd
should have the above properties.Then you want to scale the POD’s quota/period by 147/128. You cannot do this because kernel will tell you it is invalid (and interestingly when kernel does this operation by itself, it does not check it 😃 ). So you have to first scale all of the CONTAINERs under the POD, then scale the POD. For
prometheus-to-sd
, there is only one container with quota=1000 and period=100000, so we have:Then you can stop the CONTAINER via
docker stop $CONTAINER
. Now we have reproduced the root cause!Then to trigger the k8s symptom, you can either wait for it for a few minutes, or you could use
crictl
to hack around and accelerate the process:Sorry I’m not that familiar with your environment. But I think you just need to verify two things:
Are you using a kernel containing this patch? To answer this, you should fetch your kernel source code, then look at
kernel/sched/fair.c
, and search for “115%”. If you see anything, then you are using the patch. If not, you are not using it.Are your containers/pods managed by cfs? To answer this, you should see if there is anything under directory
/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/
. If yes, then you are using cfs. If not, probably not.If your answer to both questions are true, then very likely you are affected by the same bug 😃
@etcshad0vv Just learnt that the patch has also been released in Linux stable v5.3.9. See the release note and the backport commit for v5.3.9.
I wonder if that can unblock you? (I’m not sure whether you are able to pick up a v5.3.9 kernel or do you have to stuck with your current kernel version)
We’re still experiencing this in GKE 1.12 at our company. refreshing the issue
Great 😃 thanks for confirming @etcshad0vv
For posterity, here are all the patched Linux stable kernel version that contains the fix: v4.14.159, 4.19.89. The patch is also included on HEAD from v5.4.
For curious minds, feel free to read up this doc explaining the bug.
I think we can finally close this long-going bug now 😃
/assign @xueweiz /close
@xueweiz indeed the issue is solved, thank you so much for working on the kernel patching. 😃
@etcshad0vv It’s not backported into any stable branches yet. I’ll submit a backport request in a few days, will update here when: 1. The stable kernel backport request is filed 2. Some stable release happens with this patch.
@donnyv12 , @joshk0 , @luhkevin , Could you try out the 1.13.12-gke.13 release? It uses Container-optimized OS version cos-73-11647-329-0, which has the fix for a bug that we observed.
We don’t know whether it is the same root cause as this issue, but the symptom is identical. Please let me know if the new release either fixes or does not fix the problem you are observing 😃 If it fixes the problem, then we should close this issue, and we know the bug is actually a Linux kernel bug rather than a Kubernetes bug.
Hi @mcginne thanks for sharing these output! And they do confirm my theory. Basically this is not a Kubernetes bug, but a Linux kernel bug.
Yes you are right about the kernel log message, it is the culprit. It’s a kernel bug caused by https://github.com/torvalds/linux/commit/2e8e19226398db8265a8e675fcc0118b9e80c9e8
I proposed this patch to fix it: https://lore.kernel.org/patchwork/patch/1135155/ The patch has just been accepted by Linux maintainers yesterday. You would probably need to talk to your OS provider to make sure that they cherry-pick this patch in their kernel.
Hi @joshk0 , @luhkevin ,
You might be seeing a different issue than https://github.com/kubernetes/kubernetes/issues/76704
I wonder if you could let me know your kernel version? And could you run below command on your affected node?
If the command doesn’t show anything, please ignore my comment.
And if this command shows any output like below, you might potentially be affected by https://lore.kernel.org/patchwork/patch/1135155/
In that case, could you then run below command and share the output? Thanks!
We are seeing this too, on GKE 1.13.6-gke.13
or