kubernetes: “--system-reserved” kubelet option cannot work as expected
What happened:
start kubelet with below config:
[root@app ~]# cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=8686"
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"
ExecStart=/usr/bin/kubelet
$KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS
--docker-root=/SkyDiscovery/docker --enforce-node-allocatable=pods,kube-reserved,system-reserved
--kube-reserved-cgroup=/kube.slice --kubelet-cgroups=/kube.slice
--runtime-cgroups=/kube.slice --system-reserved-cgroup=/system.slice
--kube-reserved=cpu=2,memory=1Gi --system-reserved=cpu=2,memory=1Gi
--eviction-hard=memory.available<5%
[root@app ~]#
check the related cgroup file, get the following output:
[root@app ~]# cat /sys/fs/cgroup/cpu/system.slice/cpu.shares
2048
[root@app ~]# cat /sys/fs/cgroup/cpu/system.slice/cpu.cfs_period_us
100000
[root@app ~]# cat /sys/fs/cgroup/cpu/system.slice/cpu.cfs_quota_us
-1
[root@app ~]# cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
1073741824
[root@app ~]#
What you expected to happen:
/sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
is not limited.
due to memory.limit_in_bytes
is limited, may cause system services are killed by OOM, expect this config only affect the available memory resource for node(check with kubectl describe node $node_name
)
cpu.cfs_quota_us=-1
means not set limit for CPU usage, but memory.limit_in_bytes=1073741824
means set limit for memory usage, why the function of reserved cpu and reserved memory is differnent?
How to reproduce it (as minimally and precisely as possible):
start kubelet service with config:
[root@app ~]# systemctl cat kubelet
# /etc/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=http://kubernetes.io/docs/
After=cephfs-mount.service
Requires=cephfs-mount.service
[Service]
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/hugetlb/system.slice
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=8686"
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS --docker-root=/SkyDiscovery/docker --enforce-node-allocatable=pods,kube-reserved,system-reserved --kube-reserved-cgroup=/kube.slice --kubelet-cgroups=/kube.slice --runtime-cgroups=/kube.slice --system-reserved-cgroup=/system.slice --kube-reserved=cpu=2,memory=1Gi --system-reserved=cpu=2,memory=1Gi --eviction-hard=memory.available<5%
Anything else we need to know?:
i have read this doc: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ and get these info:
Be extra careful while enforcing system-reserved reservation since it can lead to critical system
services being CPU starved or OOM killed on the node.The recommendation is to enforce
system-reserved only if a user has profiled their nodes exhaustively to come up with precise
estimates and is confident in their ability to recover if any process in that group is oom_killed.
so it is reasonable to limit the file /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
?
if not use --system-reserved
, the /sys/fs/cgroup/cpu/system.slice/cpu.shares
will be 1024
by default, which will result in system services being CPU starved
Environment:
- Kubernetes version (use
kubectl version
): v1.10.12 - Cloud provider or hardware configuration: dell server PowerEdge R740
- OS (e.g. from /etc/os-release): CentOS7.4
- Kernel (e.g.
uname -a
): 3.10.0-693.el7.centos.x86.64 - Install tools:
- Others:
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 6
- Comments: 44 (25 by maintainers)
This can work well
@povsister I agree with most of what you wrote except that: [root@linux ~]# cat /sys/fs/cgroup/cpu/kubepods.slice/cpu.shares 7168 # 7 cpu cores
These are not 7 cpu cores. They are shares. In your example the node has 8 cores. How much you get out of these 8 cores under contention is depending on the configuration of the other slices at the same level. Let say you have 1024 in system.slice and 1024 in users.slice the total would be 9x1024. you get 7/9 of 8 cores for pods: 6.22 cores and 1.77 is left for the other 2 slices.
Now let say that you have not 8 but 96 cores. You keep the same value for kube-reserved and system-reserved. This gives: /kubepods.slice/cpu.shares 95x1024 The other 2 slices are unchanged: 1024 You then have 95/97 of 96 cores for pods: ~94 cores and ~2 cores are left for the other 2 slices (twice what you thought).
The thing is that with big machines these 2 cores won’t still be enough and you try to reserve more cores for system processes, let say 6 in total. /kubepods.slice/cpu.shares 90x1024 The other 2 slices are unchanged: 1024 You then have 90/92 of 96 cores for pods, which is still close to ~94 cores and ~2 cores are left for the other 2 slices. Adding 5 cores to kube-reserved + system-reserved ended up adding ~ 0.1 core to what the non-pod processes get under contention. Changing that would really help. Hopefully my explanation was not too confusing. I tried to reflect in calculations what kulong0105 and dashpole wrote.
I think this is designed behavior.
By default, kubelet only enforces
pods
allocatable. It setscpu.shares=<nodeCapacity> - <systemReserved> - <kubeReserved> - <hardEvictionThreshold>
andmemory.limit_in_bytes=<same as cpu.shares>
for<cgroupRoot>/kubepods
(cgroupfs driver) or<cgroupRoot>/kubepods.slice
(systemd driver).Let’s have an example:
Config kubelet as following. We reserve some resources for kube and system, but not enforce them.
You will get the following allocatable resource from
kubectl describe node
Let check the cgroup settings
Conclusion:
systemReserved
andkubeReserved
will always affectnodeAllocatable
, which will be used for pod scheduling reference and eviction decision.systemReservedCgroup
andkubeReservedCgroup
are only required and enforced if corresponding key are set inenforceNodeAllocatable
.cpu.shares
only guarantees cpu time under contention, it allows other process borrowing cpu if idle.cpu.cfs_quota_us
instead limits the maximum cpu usage, even there are lots of cpu in idle. I think settingcpu.shares
is enough for “reservation”.systemReserved
orkubeReserved
, it’s not surprising that acpu.shares
and amemory.limit_in_bytes
will be enforced on corresponding cgroups.By the way, I might found a bug when kubelet is handling
systemReservedCgroup
andkubeReservedCgroup
, that why I am here 🤣ok, answering myself I missed the implications of the proposal of dashpole here: https://github.com/kubernetes/kubernetes/issues/72881#issuecomment-672154398 . This is the improvement which we should do to address this issue. /triage accepted /area kubelet /priority important-longterm
This could get tricky though. https://github.com/kubernetes/kubernetes/blob/7740b8124c2f684de3caeae0f2cc5d2a1329d43e/pkg/kubelet/cm/node_container_manager_linux.go#L63-L130 kubelet does not currently assume/default
SystemReservedCgroup
and doesn’t require that the user provide it unlesssystem-reserved
is included in theEnforceNodeAllocatable
keys. Thus we would have to defaultSystemReservedCgroup
to something based on theCgroupDriver
(/system.slice
forsystemd
for example). MakingSystemReservedCgroup
required whenEnforceNodeAllocatable
does not includedsystem-reserved
is backward incompatible.However, if our defaulting is wrong for a particular setup, that could be backward incompatible as well 🤔