kubernetes: setting quotas while creating a pod fails

What happened: A pod of my deployment suddenly doesn’t come up again and hangs in “CrashLoopBackOff”

What you expected to happen: When a pod of a deployment is killed for some reason, I expect it to come up again.

How to reproduce it (as minimally and precisely as possible): This only happens sometimes and cannot be reproduced dependably, I am still trying …

Anything else we need to know?: kubectl describe pod shows this error-message: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:367: setting cgroup config for procHooks process caused \\\"failed to write 200000 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/container.slice/kubepods/burstable/pod6ba8075b-132e-11e9-ab2e-246e9674888c/a5f752a5a36fafeab7f16beb4763521cf2370efc3ba961e85a8ac1faef721b48/cpu.cfs_quota_us: invalid argument\\\"\"": unknown

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“12”, GitVersion:“v1.12.3”, GitCommit:“435f92c719f279a3a67808c80521ea17d5715c66”, GitTreeState:“clean”, BuildDate:“2018-11-26T12:46:57Z”, GoVersion:“go1.10.4”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration: Hardware
OS (e.g. from /etc/os-release): coreos 1967.3.0
Kernel (e.g. uname -a): 4.14.88
Install tools: terraform
Others:

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 37 (3 by maintainers)

Commits related to this issue

Add fix-cfs.py to mitigate https://github.com/kubernetes/kubernetes/issues/72878 — committed to xueweiz/scripts by deleted user 5 years ago

Most upvoted comments

Hi @liqlin2015

@xueweiz Do you know if there is workaround for the kernel issue? Reboot node?

Yeah reboot would for sure mitigate the problem. Until you hit it again after the reboot 😃

Another mitigation that we used and has worked for us is this script: https://github.com/xueweiz/scripts/blob/master/fix-cfs/fix-cfs.py

To verify whether it works for you, you can find a broken node and SSH into it, then do:

git clone https://github.com/xueweiz/scripts.git
sudo python2 scripts/fix-cfs/fix-cfs.py

Please share the output logs with me if it didn’t work for you. If it worked, then we can consider deploying it in production as a mitigation.

You should already observed that on a broken node, you cannot create any additional pods through k8s. So you cannot fix a broken node by deploying any DaemonSet.

But what you can do, is to run my script in a DaemonSet, give it proper privileges and mount points, and run it continuously (e.g. once per minute). That should help you repair your nodes once the problem happens quickly. If you verified that the script works for you, we can together write a DaemonSet for deploying it.

And I still not get how this issue happened sometimes, any steps to reproduce the problem?

Yes it’s reproducible. But I’m not sure whether you’ll like it or not…It’s pretty clean and simple (if you know what is it trying to do), however a bit complicated…

The bug is caused by this kernel logic, where kernel may scale the CFS quota and period by 147/128. If that happens on a Pod cgroup, it would cause the inconsistency between the pod and its containers.

So to reproduce it, you need to find a long-running Pod (on GKE, prometheus-to-sd would do), then do this:

daemon_set=prometheus-to-sd
kubectl get pods --all-namespaces | grep ${daemon_set}
# find one of the pod from above output, and record it
pod_name=prometheus-to-sd-kkws9
pod_uid=$(kubectl get pod $pod_name -n kube-system -o yaml | yq -y .metadata.uid | head -n 1)
node_name=$(kubectl get pod $pod_name -n kube-system -o yaml | yq -y .spec.nodeName | head -n 1)
container_id=$(kubectl get pod $pod_name -n kube-system -o yaml | yq -y .status.containerStatuses[0].containerID | head -n 1)
container_id=${container_id#"docker://"}

echo "POD=$pod_uid"
echo "CONTAINER=$container_id"
echo "gcloud compute ssh $node_name"

The you should SSH into the node (gcloud compute ssh $node_name), and run below commands:

# Set POD and CONTAINER environment variables
cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/$CONTAINER/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/$CONTAINER/cpu.cfs_period_us
cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/cpu.cfs_period_us

You should see the period_us for both POD and CONTAINER is 100000. And you should see the quota_us for both POD and CONTAINER should be equal, and should be between [1000, 100000]. If not, find another Pod 😃. In GKE, prometheus-to-sd should have the above properties.

Then you want to scale the POD’s quota/period by 147/128. You cannot do this because kernel will tell you it is invalid (and interestingly when kernel does this operation by itself, it does not check it 😃 ). So you have to first scale all of the CONTAINERs under the POD, then scale the POD. For prometheus-to-sd, there is only one container with quota=1000 and period=100000, so we have:

echo 114843 > /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/$CONTAINER/cpu.cfs_period_us
echo 1148 > /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/$CONTAINER/cpu.cfs_quota_us
echo 1148 > /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/cpu.cfs_quota_us
echo 114843 > /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod$POD/cpu.cfs_period_us

Then you can stop the CONTAINER via docker stop $CONTAINER. Now we have reproduced the root cause!

Then to trigger the k8s symptom, you can either wait for it for a few minutes, or you could use crictl to hack around and accelerate the process:

$ crictl pods | grep prometheus-to-sd
83013cc154a31       27 hours ago        Ready               prometheus-to-sd-sh7kh                              kube-system         1

# Get the container in that pod using the pod id from above
$ crictl ps -a | grep 83013cc154a31
c72d0d73c7a88       5b48699744eaf        8 seconds ago       Created             prometheus-to-sd            318                 83013cc154a31

# start the container!
$ crictl start c72d0d73c7a88
FATA[0002] Starting the container "c72d0d73c7a88" failed: rpc error: code = Unknown desc = failed to start container "c72d0d73c7a88": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 1000 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod542047ca-d678-11e9-8bed-42010a8001c1/c72d0d73c7a88c514d933e1de43c4b3d9db646342ae989da756b01888de5faa6/cpu.cfs_quota_us: invalid argument\\\"\"": unknown

we only see the problem in a few env with ubuntu 16.04 OS + docker ce 18.09.7 (using cgroupfs ).

Sorry I’m not that familiar with your environment. But I think you just need to verify two things:

Are you using a kernel containing this patch? To answer this, you should fetch your kernel source code, then look at kernel/sched/fair.c, and search for “115%”. If you see anything, then you are using the patch. If not, you are not using it.
Are your containers/pods managed by cfs? To answer this, you should see if there is anything under directory /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/. If yes, then you are using cfs. If not, probably not.

If your answer to both questions are true, then very likely you are affected by the same bug 😃

xueweiz on Dec 4, 2019

@etcshad0vv Just learnt that the patch has also been released in Linux stable v5.3.9. See the release note and the backport commit for v5.3.9.

I wonder if that can unblock you? (I’m not sure whether you are able to pick up a v5.3.9 kernel or do you have to stuck with your current kernel version)

xueweiz on Dec 3, 2019

We’re still experiencing this in GKE 1.12 at our company. refreshing the issue

joshk0 on Sep 5, 2019

Great 😃 thanks for confirming @etcshad0vv

For posterity, here are all the patched Linux stable kernel version that contains the fix: v4.14.159, 4.19.89. The patch is also included on HEAD from v5.4.

For curious minds, feel free to read up this doc explaining the bug.

I think we can finally close this long-going bug now 😃

/assign @xueweiz /close

xueweiz on Mar 16, 2020

@xueweiz indeed the issue is solved, thank you so much for working on the kernel patching. 😃

etcshad0vv on Mar 14, 2020

@etcshad0vv It’s not backported into any stable branches yet. I’ll submit a backport request in a few days, will update here when: 1. The stable kernel backport request is filed 2. Some stable release happens with this patch.

xueweiz on Nov 29, 2019

@donnyv12 , @joshk0 , @luhkevin , Could you try out the 1.13.12-gke.13 release? It uses Container-optimized OS version cos-73-11647-329-0, which has the fix for a bug that we observed.

We don’t know whether it is the same root cause as this issue, but the symptom is identical. Please let me know if the new release either fixes or does not fix the problem you are observing 😃 If it fixes the problem, then we should close this issue, and we know the bug is actually a Linux kernel bug rather than a Kubernetes bug.

xueweiz on Nov 26, 2019

Hi @mcginne thanks for sharing these output! And they do confirm my theory. Basically this is not a Kubernetes bug, but a Linux kernel bug.

Yes you are right about the kernel log message, it is the culprit. It’s a kernel bug caused by https://github.com/torvalds/linux/commit/2e8e19226398db8265a8e675fcc0118b9e80c9e8

I proposed this patch to fix it: https://lore.kernel.org/patchwork/patch/1135155/ The patch has just been accepted by Linux maintainers yesterday. You would probably need to talk to your OS provider to make sure that they cherry-pick this patch in their kernel.

xueweiz on Oct 10, 2019

Hi @joshk0 , @luhkevin ,

You might be seeing a different issue than https://github.com/kubernetes/kubernetes/issues/76704

I wonder if you could let me know your kernel version? And could you run below command on your affected node?

sudo find /sys/fs/cgroup -name cpu.cfs_period_us -printf "%p\t" -exec cat {} \; | grep -v 100000

If the command doesn’t show anything, please ignore my comment.

And if this command shows any output like below, you might potentially be affected by https://lore.kernel.org/patchwork/patch/1135155/

/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod9649f0a1-e3c1-11e9-b00c-42010a8000cf/cpu.cfs_period_us	114843

In that case, could you then run below command and share the output? Thanks!

# replace the podxxxx with your folder name

sudo find /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod9649f0a1-e3c1-11e9-b00c-42010a8000cf/ -name cpu.cfs_*_us -printf "%p\t" -exec cat {} \;

xueweiz on Oct 5, 2019

We are seeing this too, on GKE 1.13.6-gke.13

luhkevin on Sep 19, 2019

oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:291: setting cgroup config for ready process caused \"failed to write 100000 to cpu.cfs_period_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods/pod7beb03fa-d009-11e9-a6b4-42010a800245/592ee7f21ec5021c279f87476775f041a044d4cb3c6bc70fefe967512c264c35/cpu.cfs_period_us: invalid argument\""

OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 100000 to cpu.cfs_period_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podd97b5de1-c855-11e9-828e-42010a800264/CONTAINER-NAME-REDACTED/cpu.cfs_period_us: invalid argument\\\"\""

joshk0 on Sep 5, 2019