kubernetes: (usage - inactive file) greater than capacity causing eviction

What happened?

/sig node

free shows the total memory is 32Gi and used only 6Gi. Free is 12Gi Shared 1.6Gi buff/cache 13Gi available 19Gi

4.19.90-23.8.v2101.ky10.aarch64

https://kubernetes.io/examples/admin/resource/memory-available.sh This is a script to calculate memory.available from https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/

memory.capacity_in_bytes   32879017984 (~32Gi)
memory.usage_in_bytes       80342220800 (~80Gi)
memory.total_inactive_file    10048962560 (~10Gi)
memory.working_set             70293258240 (~70Gi)
memory.available_in_bytes -37414240256 (-35Gi)
memory.available_in_kb      -36537344
memory.available_in_mb     -35681

workingset memory(70Gi) = memory.usage_in_bytes(80Gi) - inactive file (10Gi)
- this will be including active file memory
availabel = capacity (32Gi) - working set memory(70Gi) = - 38Gi < 0

With the script, we got a negative value available in bytes which is obviously a bug of the operation system Kylin.

What did you expect to happen?

No eviction if the problem is that memory.usage_in_bytes greater than memory.capacity_in_byte.

Before the OS fixes the bug, can kubelet stop evicting pods when usage_in_bytes is not properly set? fixes

How can we reproduce it (as minimally and precisely as possible)?

Yes.

And the workaround is to drop cache and the usage_in_bytes and active file/inactive file will become correct.

Anything else we need to know?

Eviction is so dangerous if the condition checking is based on wrong data.

Kubernetes version

$ kubectl version
1.23

1.25 as well

Cloud provider

vsphere

OS version

4.19.90-23.8.v2101.ky10.aarch64

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

kubeadm

Container runtime (CRI) and version (if applicable)

docker

Related plugins (CNI, CSI, …) and versions (if applicable)

calico

About this issue

Original URL
State: open
Created 2 years ago
Comments: 16 (11 by maintainers)

Most upvoted comments

Maybe k8s should not use usage_in_bytes because it’s a fuzz value. See in https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

5.5 usage_in_bytes

For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

kuangxiaoying on Jun 1, 2023

Regardless if this was caused by active file or not I think we should get back to discussing the initial question of this issue which is, What is the expected eviction behavior if the usage_bytes || working_set_bytes exceed the capacity_bytes.

i think that if the usage_bytes > capacity then there is some bug in the underlaying os that exposes the memoryStats info. So we should not evict based on this but make sure to raise errors of some kind and alert the owner of the cluster on the issue.

That is my two cents.

Bryce-Soghigian on Dec 12, 2022