kubernetes: Container start failed on large cpu machines - cpu-shares gets set larger than the maximum of 262144

What happened:

Deployment of a POD onto a machine with 488 cores fails if the POD gets deployed with QoS Guaranteed or Burstable when cpu request is larger than 256.

Kubelet Error Trace: RunContainerError: failed to start container “4ad8bf2f988c9c387d8136ac664caf4af80648ac436e92dac6ecd1f9aece9144”: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused "process_linux.go:415: setting cgroup config for procHooks process caused \"The maximum allowed cpu-shares is 262144\""”: unknown

What you expected to happen:

Normal start of the container with the right number of cpu-shares set yo the cgroup the POD runs in.

How to reproduce it (as minimally and precisely as possible):

Install kubernetes on a machine with more than 300 cores and deploy a POD with QoS Guaranteed or Burstable where the cpu-request value is set to more than 256 cpus.

Anything else we need to know?:

Deployment without CPU request and limits works. Deployments with cpu-request <= 256 works as well.

I found this change: https://github.com/kubernetes/kubernetes/pull/93248 Unfortunately the changed code does not get used in all cases - is not used in my case.

Environment:

Kubernetes version (use kubectl version): Kubernetes v1.19.8
Cloud provider or hardware configuration: AWS / EC2 u-6tb1.metal (6 TB memory / 488 vCPUS)
OS (e.g: cat /etc/os-release): NAME=“SLES” VERSION=“15-SP1” VERSION_ID=“15.1” PRETTY_NAME=“SUSE Linux Enterprise Server 15 SP1” ID=“sles” ID_LIKE=“suse” ANSI_COLOR=“0;32” CPE_NAME=“cpe:/o:suse:sles:15:sp1” VARIANT_ID=“chost”
Kernel (e.g. uname -a): Linux ip-10-250-0-15.eu-central-1.compute.internal 4.12.14-197.72-default
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (12 by maintainers)

Most upvoted comments

/assign @odinuge

matthyx on Nov 23, 2021

I think the issue is in the Kubelet.

The kernel uses the range [2-262144] for cpu.shares, but the kubelet can generate and request values outside of the valid range.

When using cgroupfs, the kernel automatically clamps the value. Instead systemd has an additional check and refuses to use an invalid value (even if the kernel would accept and clamp it).

The clamping is just to make sure the systemd driver works in the same way as cgroupfs, but now any value set above the valid range will get assigned the same number of cpu.shares.

@renkes would you mind opening a PR to add the clamping also for your case?

giuseppe on Jun 25, 2021