kubernetes: (usage - inactive file) greater than capacity causing eviction
What happened?
/sig node
free shows the total memory is 32Gi and used only 6Gi.
Free is 12Gi
Shared 1.6Gi
buff/cache 13Gi
available 19Gi
4.19.90-23.8.v2101.ky10.aarch64
- https://kubernetes.io/examples/admin/resource/memory-available.sh This is a script to calculate memory.available from https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
memory.capacity_in_bytes 32879017984 (~32Gi)
memory.usage_in_bytes 80342220800 (~80Gi)
memory.total_inactive_file 10048962560 (~10Gi)
memory.working_set 70293258240 (~70Gi)
memory.available_in_bytes -37414240256 (-35Gi)
memory.available_in_kb -36537344
memory.available_in_mb -35681
- workingset memory(70Gi) = memory.usage_in_bytes(80Gi) - inactive file (10Gi)
-
- this will be including active file memory
- availabel = capacity (32Gi) - working set memory(70Gi) = - 38Gi < 0
With the script, we got a negative value available in bytes which is obviously a bug of the operation system Kylin.
What did you expect to happen?
No eviction if the problem is that memory.usage_in_bytes greater than memory.capacity_in_byte.
Before the OS fixes the bug, can kubelet stop evicting pods when usage_in_bytes is not properly set? fixes
How can we reproduce it (as minimally and precisely as possible)?
Yes.
And the workaround is to drop cache and the usage_in_bytes and active file/inactive file will become correct.
Anything else we need to know?
Eviction is so dangerous if the condition checking is based on wrong data.
Kubernetes version
$ kubectl version
1.23
1.25 as well
Cloud provider
vsphere
OS version
4.19.90-23.8.v2101.ky10.aarch64
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
kubeadm
Container runtime (CRI) and version (if applicable)
docker
Related plugins (CNI, CSI, …) and versions (if applicable)
calico
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16 (11 by maintainers)
Maybe k8s should not use usage_in_bytes because it’s a fuzz value. See in https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
5.5 usage_in_bytes
For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).
Regardless if this was caused by active file or not I think we should get back to discussing the initial question of this issue which is, What is the expected eviction behavior if the usage_bytes || working_set_bytes exceed the capacity_bytes.
i think that if the usage_bytes > capacity then there is some bug in the underlaying os that exposes the
memoryStatsinfo. So we should not evict based on this but make sure to raise errors of some kind and alert the owner of the cluster on the issue.That is my two cents.