kubernetes: “--system-reserved” kubelet option cannot work as expected

What happened:

start kubelet with below config:

[root@app ~]# cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf 
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=8686"
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"
ExecStart=/usr/bin/kubelet 
$KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS 
--docker-root=/SkyDiscovery/docker --enforce-node-allocatable=pods,kube-reserved,system-reserved 
--kube-reserved-cgroup=/kube.slice --kubelet-cgroups=/kube.slice 
--runtime-cgroups=/kube.slice --system-reserved-cgroup=/system.slice 
--kube-reserved=cpu=2,memory=1Gi --system-reserved=cpu=2,memory=1Gi 
--eviction-hard=memory.available<5%
[root@app ~]#

check the related cgroup file, get the following output:

[root@app ~]# cat /sys/fs/cgroup/cpu/system.slice/cpu.shares 
2048
[root@app ~]# cat /sys/fs/cgroup/cpu/system.slice/cpu.cfs_period_us 
100000
[root@app ~]# cat /sys/fs/cgroup/cpu/system.slice/cpu.cfs_quota_us 
-1
[root@app ~]# cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
1073741824
[root@app ~]#

What you expected to happen:

/sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes is not limited.

due to memory.limit_in_bytes is limited, may cause system services are killed by OOM, expect this config only affect the available memory resource for node(check with kubectl describe node $node_name)

cpu.cfs_quota_us=-1 means not set limit for CPU usage, but memory.limit_in_bytes=1073741824 means set limit for memory usage, why the function of reserved cpu and reserved memory is differnent?

How to reproduce it (as minimally and precisely as possible):

start kubelet service with config:

[root@app ~]# systemctl cat kubelet 
# /etc/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=http://kubernetes.io/docs/
After=cephfs-mount.service
Requires=cephfs-mount.service

[Service]
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/hugetlb/system.slice
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=8686"
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS --docker-root=/SkyDiscovery/docker --enforce-node-allocatable=pods,kube-reserved,system-reserved --kube-reserved-cgroup=/kube.slice --kubelet-cgroups=/kube.slice --runtime-cgroups=/kube.slice --system-reserved-cgroup=/system.slice --kube-reserved=cpu=2,memory=1Gi --system-reserved=cpu=2,memory=1Gi --eviction-hard=memory.available<5%

Anything else we need to know?:

i have read this doc: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ and get these info:

Be extra careful while enforcing system-reserved reservation since it can lead to critical system
services being CPU starved or OOM killed on the node.The recommendation is to enforce 
system-reserved only if a user has profiled their nodes exhaustively to come up with precise
estimates and is confident in their ability to recover if any process in that group is oom_killed.

so it is reasonable to limit the file /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes ?

if not use --system-reserved, the /sys/fs/cgroup/cpu/system.slice/cpu.shares will be 1024 by default, which will result in system services being CPU starved

Environment:

Kubernetes version (use kubectl version): v1.10.12
Cloud provider or hardware configuration: dell server PowerEdge R740
OS (e.g. from /etc/os-release): CentOS7.4
Kernel (e.g. uname -a): 3.10.0-693.el7.centos.x86.64
Install tools:
Others:

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 6
Comments: 44 (25 by maintainers)

Most upvoted comments

...
ExecStart=/usr/bin/kubelet \
$KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS 
--docker-root=/SkyDiscovery/docker --enforce-node-allocatable=pods,kube-reserved,system-reserved \
--kube-reserved-cgroup=/kube.slice --kubelet-cgroups=/kube.slice \
--runtime-cgroups=/kube.slice --system-reserved-cgroup=/system.slice \
--kube-reserved=cpu=2,memory=1Gi --system-reserved=cpu=2,memory=1Gi \
--eviction-hard=memory.available<5%

This can work well

fimreal on Mar 21, 2019

@povsister I agree with most of what you wrote except that: [root@linux ~]# cat /sys/fs/cgroup/cpu/kubepods.slice/cpu.shares 7168 # 7 cpu cores

These are not 7 cpu cores. They are shares. In your example the node has 8 cores. How much you get out of these 8 cores under contention is depending on the configuration of the other slices at the same level. Let say you have 1024 in system.slice and 1024 in users.slice the total would be 9x1024. you get 7/9 of 8 cores for pods: 6.22 cores and 1.77 is left for the other 2 slices.

Now let say that you have not 8 but 96 cores. You keep the same value for kube-reserved and system-reserved. This gives: /kubepods.slice/cpu.shares 95x1024 The other 2 slices are unchanged: 1024 You then have 95/97 of 96 cores for pods: ~94 cores and ~2 cores are left for the other 2 slices (twice what you thought).

The thing is that with big machines these 2 cores won’t still be enough and you try to reserve more cores for system processes, let say 6 in total. /kubepods.slice/cpu.shares 90x1024 The other 2 slices are unchanged: 1024 You then have 90/92 of 96 cores for pods, which is still close to ~94 cores and ~2 cores are left for the other 2 slices. Adding 5 cores to kube-reserved + system-reserved ended up adding ~ 0.1 core to what the non-pod processes get under contention. Changing that would really help. Hopefully my explanation was not too confusing. I tried to reflect in calculations what kulong0105 and dashpole wrote.

fgiloux on Apr 16, 2021

I think this is designed behavior.

By default, kubelet only enforces pods allocatable. It sets cpu.shares=<nodeCapacity> - <systemReserved> - <kubeReserved> - <hardEvictionThreshold> and memory.limit_in_bytes=<same as cpu.shares> for <cgroupRoot>/kubepods (cgroupfs driver) or <cgroupRoot>/kubepods.slice (systemd driver).

Let’s have an example:

Config kubelet as following. We reserve some resources for kube and system, but not enforce them.

cgroupDriver: systemd
cgroupRoot: /
enforceNodeAllocatable:
  - pods
kubeReserved:
  memory: 1Gi
  cpu: 500m # equals 0.5 cpu
kubeletCgroups: /kubereserved.slice/kubelet.service
kubeReservedCgroup: /kubereserved.slice
systemReserved:
  memory: 1.5Gi
  cpu: 500m
systemReservedCgroup: /system.slice

You will get the following allocatable resource from kubectl describe node

Capacity:
  cpu:                8
  ephemeral-storage:  92115000Ki
  hugepages-2Mi:      0
  memory:             16266012Ki  # 15.5125 GiB
  pods:               110
Allocatable:
  cpu:                7 # Apparently 7 = 8 - 0.5 - 0.5
  ephemeral-storage:  84893183860
  hugepages-2Mi:      0
  memory:             13542172Ki  # 12.9148 GiB = 15.5125 GiB - 1 GiB - 1.5 GiB - 100 MiB (default hard eviction threshold)
  pods:               110

Let check the cgroup settings

[root@linux ~]# cat /sys/fs/cgroup/cpu/kubepods.slice/cpu.shares
7168 # 7 cpu cores
[root@linux ~]# cat /sys/fs/cgroup/cpu/kubepods.slice/cpu.cfs_quota_us
-1  # unlimited
[root@linux ~]# cat /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes
13972041728  # = 13644572 KiB, which comes from 13542172Ki (nodeAllocatable) + 100MiB (hardEvictionThreshold)

[root@linux ~]# cat /sys/fs/cgroup/cpu/kubereserved.slice/cpu.shares
1024 # default value. Not "enforced"
[root@linux ~]# cat /sys/fs/cgroup/memory/kubereserved.slice/memory.limit_in_bytes
9223372036854771712  # default value. Not "enforced" either.

Conclusion:

systemReserved and kubeReserved will always affect nodeAllocatable, which will be used for pod scheduling reference and eviction decision.
systemReservedCgroup and kubeReservedCgroup are only required and enforced if corresponding key are set in enforceNodeAllocatable.
cpu.shares only guarantees cpu time under contention, it allows other process borrowing cpu if idle. cpu.cfs_quota_us instead limits the maximum cpu usage, even there are lots of cpu in idle. I think setting cpu.shares is enough for “reservation”.
If you “enforce” systemReserved or kubeReserved, it’s not surprising that a cpu.shares and a memory.limit_in_bytes will be enforced on corresponding cgroups.

By the way, I might found a bug when kubelet is handling systemReservedCgroup and kubeReservedCgroup , that why I am here 🤣

povsister on Apr 16, 2021

ok, answering myself I missed the implications of the proposal of dashpole here: https://github.com/kubernetes/kubernetes/issues/72881#issuecomment-672154398 . This is the improvement which we should do to address this issue. /triage accepted /area kubelet /priority important-longterm

ffromani on Jun 25, 2021

This could get tricky though. https://github.com/kubernetes/kubernetes/blob/7740b8124c2f684de3caeae0f2cc5d2a1329d43e/pkg/kubelet/cm/node_container_manager_linux.go#L63-L130 kubelet does not currently assume/default SystemReservedCgroup and doesn’t require that the user provide it unless system-reserved is included in the EnforceNodeAllocatable keys. Thus we would have to default SystemReservedCgroup to something based on the CgroupDriver (/system.slice for systemd for example). Making SystemReservedCgroup required when EnforceNodeAllocatable does not included system-reserved is backward incompatible.

However, if our defaulting is wrong for a particular setup, that could be backward incompatible as well 🤔

sjenning on Aug 10, 2020