kubernetes: kubelet counts active page cache against memory.available (maybe it shouldn't?)

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): active_file inactive_file working_set WorkingSet cAdvisor memory.available


Is this a BUG REPORT or FEATURE REQUEST? (choose one): We’ll say BUG REPORT (though this is arguable)

Kubernetes version (use kubectl version): 1.5.3

Environment:

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release): NAME=“Ubuntu” VERSION=“14.04.5 LTS, Trusty Tahr” ID=ubuntu ID_LIKE=debian PRETTY_NAME=“Ubuntu 14.04.5 LTS” VERSION_ID=“14.04”

  • Kernel (e.g. uname -a): Linux HOSTNAME_REDACTED 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:

  • Others:

What happened: A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn’t have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available.

What you expected to happen: memory.available would not have active page cache counted against it, since it is reclaimable by the kernel. This also seems to greatly complicate a general case for configuring memory eviction policies, since in a general sense it’s effectively impossible to understand how much page cache will be active at any given time on any given node, or how long it will stay active (in relation to eviction grace periods).

How to reproduce it (as minimally and precisely as possible): Cause a node to chew up enough active page cache that the existing calculation for memory.available trips a memory eviction threshold, even though the threshold would not be tripped if the page cache - active and inactive - were freed for anon memory.

Anything else we need to know: I discussed this with @derekwaynecarr in #sig-node and am opening this issue at his request (conversation starts here).

Before poking around on Slack or opening this issue, I did my best to read through the 1.5.3 release code, Kubernetes documentation, and cgroup kernel documentation to make sure I understood what was going on here. The short of it is that I believe this calculation:

memory.available := node.status.capacity[memory] - node.stats.memory.workingSet

Is using cAdvisor’s value for working set, which if I traced the code correctly, amounts to:

$cgroupfs/memory.usage_in_bytes - total_inactive_file

Where, according to my interpretation of the kernel documentation, usage_in_bytes includes all page cache:

$kernel/Documentation/cgroups/memory.txt

 
The core of the design is a counter called the res_counter. The res_counter
tracks the current memory usage and limit of the group of processes associated
with the controller.
 
...
 
2.2.1 Accounting details
 
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.

Ultimately my issue is concerning how I can set generally applicable memory eviction thresholds if active page cache is counting against those, and there’s no way to to know (1) generally how much page cache will be active across a cluster’s nodes, to use as part of general threshold calculations (2) how long active page cache will stay active, to use as part of eviction grace period calculations.

I understand that there are many layers here and that this is not a particularly simple problem to solve generally correctly, or even understand top to bottom. So I apologize up front if any of my conclusions are incorrect or I’m missing anything major, and I appreciate any feedback you all can provide.

As requested by @derekwaynecarr: cc @sjenning @derekwaynecarr

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 153
  • Comments: 114 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I’d like to share some observations, though I can’t say I have a good solution to offer yet, other than to set a memory limit equal to the memory request for any pod that makes use of the file cache.

Perhaps it’s just a matter of documenting the consequences of not having a limit set.

Or perhaps an explicit declaration of cache reservation should exist in the podspec, in lieu of assuming “inactive -> not important to reserve”.

Another possibility I’ve not explored is cgroup soft limits, and/or a more heuristic based detection of memory pressure.

Contrary Interpretations of “Inactive”

Kubernetes seems to have an implicit belief that the kernel is finding the working set, and keeping it in the active LRU. Everything not in the working state goes in the inactive LRU and is reclaimable.

A quote from the documentation [emphasis added]:

The value for memory.available is derived from the cgroupfs instead of tools like free -m. This is important because free -m does not work in a container, and if users use the node allocatable feature, out of resource decisions are made local to the end user Pod part of the cgroup hierarchy as well as the root node. This script reproduces the same set of steps that the kubelet performs to calculate memory.available. The kubelet excludes inactive_file (i.e. # of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that memory is reclaimable under pressure.

Compare a comment from mm/workingset.c in the Linux kernel:

All that is known about the active list is that the pages have been accessed more than once in the past. This means that at any given time there is actually a good chance that pages on the active list are no longer in active use.

While both Kubernetes and Linux agree that the working set is in the active list, they disagree about where memory in excess of the working set goes. I’ll show that Linux actually wants to minimize the size of the inactive list, putting all extra memory in the active list, as long as there’s a process using the file cache enough for it to matter (which may not be the case if the workload on a node consists entirely of stateless web servers, for example).

The Dilemma

One running an IO workload on Kubernetes must:

  1. set a memory limit equal to or less than the memory request for any pod that utilizes the file LRU list, or
  2. accept that any IO workload will eventually exceed its memory request through normal and healthy utilization of the file page cache.

The Kernel Implementation

Note I’m not a kernel expert. These observations are based on my cursory study of the code.

When a page is first loaded, add_to_page_cache_lru() is called. Normally this adds the page to the inactive list, unless this is a “refault”. More on that later.

Subsequent accesses to a page call mark_page_accessed() within mm/swap.c. If the page was on the inactive list it’s moved to the active list, incrementing pgactivate in /proc/vmstat. Unfortunately this counter does not distinguish between the anonymous and file LRUs, but examining pgactivate in conjunction with nr_inactive_file and nr_active_file gives a clear enough picture. These same counters are available within memory.stat for cgroups as well.

Accessing a page twice is all that’s required to get on the active list. If the inactive list is too big, there may not be enough room in the active list to contain the working set. If the inactive list is too small, pages may be pushed off the tail before they’ve had a chance to move to the active list, even if they are part of the working set.

mm/workingset.c deals with this balance. It forms estimates from the inactive file LRU list stats and maintains “shadow entries” for pages recently evicted from the inactive list. When add_to_page_cache_lru() is adding a page and it sees a shadow entry for that page it calls workingset_refault() and workingset_refault is incremented in /proc/vmstat. If that returns true then the page is promoted directly to the active list and workingset_activate in /proc/vmstat is incremented. It appears this code path does not increment pgactivate.

So a page accessed twice gets it added to the active list. What puts downward pressure on the active list?

During scans (normally by kswapd, but directly by an allocation if there are insufficient free pages), inactive_list_is_low() may return true. If it does, shrink_active_list() is called.

The comments to inactive_list_is_low() are insightful:

 * The inactive anon list should be small enough that the VM never has
 * to do too much work.
 *
 * The inactive file list should be small enough to leave most memory
 * to the established workingset on the scan-resistant active list,
 * but large enough to avoid thrashing the aggregate readahead window.
 *
 * Both inactive lists should also be large enough that each inactive
 * page has a chance to be referenced again before it is reclaimed.
 *
 * If that fails and refaulting is observed, the inactive list grows.
 *
 * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages
 * on this LRU, maintained by the pageout code. An inactive_ratio
 * of 3 means 3:1 or 25% of the pages are kept on the inactive list.
 *
 * total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB

So, the presence of refaults (meaning, pages are faulted, pushed off the inactive list, then faulted again) indicates the inactive list is too small, which means the active list is too big. If refaults aren’t happening then the ratio of active:inactive is capped by a formula based on the total size of inactive + active. A larger cache favors a larger active list in proportion to the inactive list.

I believe (through I’ve not confirmed with experiment) that the presence of a large number of refaults could also mean there simply isn’t enough memory available to contain the working set. The refaults will cause the inactive list to grow and the active list to shrink, causing Kubernetes to think there is less memory pressure, the opposite of reality!

Please do not use this issue as a catch-all for every kind of memory problem that might be encountered while running Kubernetes. While some kernels have bugs in cgroup accounting, this issue is not about that. Read the issue description carefully:

What happened:

A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn’t have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available.

Take note: A pod was evicted due to memory pressure on the node. If you are experiencing OOM kills, where the kernel kills the process and “oom-killer” appears in the system log, you are not experiencing this issue. If you are unable to create docker containers, you are not experiencing this issue. If setting the memory limit equal to the request did not work for you, then the problem is not that the workaround failed, the problem is you are experiencing a different issue.

A pod eviction due to memory pressure on the node is not an OOM kill. The kernel will invoke the OOM killer when it is unable to service a page fault, due to insufficient physical RAM on the host or limits imposed on a particular cgroup. Kubernetes does not OOM kill.

A pod eviction is a decision made by Kubernetes. It depends on cluster configuration. “oom-killer” will not appear in the system logs, and the node condition may show MemoryPressure if you manage to catch it at the right time.

This issue is about that eviction logic in Kubernetes. It is not about the kernel. Even in the absence of any kernel bugs, a process can (will, if it does file IO) utilize page cache beyond its memory request, and up to but not beyond its memory limit, and then kubelet may decide the node is experiencing “memory pressure” and evict pods, even though the RAM usage is predominately cache and there is no actual memory pressure. It is, in my evaluation, a design bug in kubernetes due to a misunderstanding of what “active” cache means. It can not be fixed by upgrading the kernel or changing kernel parameters.

If you are experiencing OOM kills or a kernel bug, please don’t post about it here, as that only detracts from the actual issue.

I ran into the same issue today. On node with 32GB memory, 16+GB is cached. When the memory used + cache exceeded 29GB (~90% of 32GB), the kubelet tried to evict all the pods which shouldn’t have happened since the node still had close to 50% of memory available, although in cache. Is there a fix to this issue?

Also having this problem. We had 53GB of available memory and 0.5GB free. 52.5GB is in buff/cache and it starts trying to kill pods due to SystemOOM.

I don’t think you understand how cache works. Processes don’t control how much cache they use. A process that uses a lot of cache doesn’t deserve to be “penalized”. It’s done nothing wrong. It’s simply using that cache because it’s done some file IO.

It’s more efficient to use all available RAM for cache. Unused RAM does nothing but waste money, and it takes work to figure out what pages can be evicted from the cache. Why would you spend CPU time to evict pages from cache, when there’s nothing better to do with that RAM, and when you might be evicting something that could avoid a disk read in the future?

The problem is Kubernetes treats the perfectly normal and acceptable condition of using nearly all excess physical RAM as disk cache as an operational emergency, and starts evicting pods to “fix” it. So if you don’t want Kubernetes terminating your database every time it’s warmed up the disk cache, you must set a memory limit equal to the request, avoiding the situation entirely.

Would you rather:

  1. Make a determination about how much RAM a process requires to operate, knowing that the only way the process gets terminated is if your determination was wrong, or
  2. Have your process periodically, arbitrarily, and unpredictably terminated for operating normally, even though it was technically possible to continue running the process with no ill effect, and neither the developer or system administrator can do anything about it?

If you choose the former, you must set a memory limit equal to the request. Not setting a memory limit, or setting it above the request, chooses the latter.

I have been waiting for 4 years!!!!! any news??

Kubernetes resource limits are implemented with cgroups. It is intended for cgroups to limit all RAM, including cache. If you want to use available RAM as cache, you must set a bigger limit, or no limit.

Ostensibly, if Cassandra could run fine on 10 GB but it could be faster with up to 100 GB if it happened to be available, then you could set the RAM request to 10 GB and the limit to 100 GB.

The problem, and the topic of this bug, is if you then run a process which uses a lot of cache, kubernetes will then decide the node is experiencing “memory pressure” and evict pods, even though the kernel could evict that cache if there was any actual memory pressure. This is because the kubernetes logic (“active” pages probably aren’t reclaimable) is not in line with the reality in the kernel logic (an “active” page is anything that’s been used twice).

I think the intent is to avoid a less graceful OOM kill, but the effect in practice is the memory limit must always be set, and equal to the memory request. Otherwise pods keep getting evicted, usually the one that’s using the cache since it’s usually the one most over the memory request. Even if the application can tolerate the disruption, restarting the process on another node every time it’s warmed up it’s cache is rather pointless.

Nope!!! But thanks for pinging 54 people who are subscribed to this issue???

How can this be still open? Shouldn’t there be a way to reliably detect the available memory? I mean 5 years is a long time, someone might have invented subtraction and total and used memory calculation in the mean time. 😉

Fair enough. This consideration kind of suggests that a new type of requests/limits has to be added: page cache. While it still cannot be counted as the actual process memory usage (RSS + swap), it cannot be totally ignored, either, since having a certain amount of page cache available (possible even dedicated) may be important for some types of workloads.

Remaining problem here appears to be vm.dirty_bytes configuration in kernel. This isn’t the read caches which are problematic (any more), it’s the write back cache.

Memory is allocated into the writeback cache pool and attributed to a cgroup. However, that cache is only flushed once the system wide limits are exceeded. Per default 10% of total system RAM before write out begins asynchronously, 20% before write operations would even start blocking. All in expectation that disk access costs can be reduced by write-combining, so it does make at least some sense.

2009 there had been attempts to bring per-cgroup dirty limits into the kernel, but the discussion around the patch sets derailed into treating it as a performance (non-)optimization and the patch set was discarded.

Hard workaround: Explicitly set vm.dirty_bytes to a significantly lower value, such that vm.dirty_bytes + working set < per cgroup limit.

You should then also decrement vm.dirty_background_bytes to 1/2 or 1/4 of vm.dirty_bytes, and decrement vm.dirty_writeback_centisecs significantly to ensure that vm.dirty_background_bytes is checked often enough, and async writeback remains common case despite smaller buffers.

Worst case, each cgroup can allocate all of vm.dirty_bytes during a burst write, on top of the regular working set. It’s essential that the vm.dirty_bytes quota must fit within the cgroup quota, or write back will never be throttled, resulting in OOM before stall.

Be careful if you are relying on mmap() syscalls anywhere though, they are also allocated from vm.dirty_bytes quota. If you are in that unlucky situation, the only option left is to perform file writes from any context with bursts explicitly bypassing or flushing the write back cache (using O_DIRECT, O_DSYNC or spamming fsync() etc.).

Root cause is a design fault in the dirty page tracking in cgroups 2. It makes perfect sense to limit dirty pages locked in memory via explicit mmap() and alike (as they can’t be freed), but it makes no sense to maintain the attribution after the process has already released the memory. Pressuring the kernel into prioritizing write back got it all backwards.

It seems that the kernel bug which causes this error is finally fixed now, and will be released in kernel-3.10.0-1075.el7, which is due in RHEL 7.8, but goodness knows when that will be, as RHEL 7.7 only came out on August 6th, ~3 weeks ago.

https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101

We have replicated this issue running on AKS where the “buff/cache” keeps increasing indefinitely up to the point the POD is evicted.

kubectl version:

Client Version: version.Info{Major:“1”, Minor:“19”, GitVersion:“v1.19.3”, GitCommit:“1e11e4a2108024935ecfcb2912226cedeafd99df”, GitTreeState:“clean”, BuildDate:“2020-10-14T12:50:19Z”, GoVersion:“go1.15.2”, Compiler:“gc”, Platform:“windows/amd64”}

Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.10”, GitCommit:“62876fc6d93e891aa7fbe19771e6a6c03773b0f7”, GitTreeState:“clean”, BuildDate:“2020-10-16T20:43:34Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”}

image

On the above image the memory used/allocated for the running dotnet program remains quite constant but the “buff/cache” eventually uses the entire memory causing the eviction.

This should not happen since buff/cache is reclaimable memory:

“buff/cache – is the combined memory used by the kernel buffers and page cache and slabs. This memory can be reclaimed at any time if needed by the applications.”

1. Don't read/write anything on disk. Only use ram or a database (mysql, memcached, redis, etc.)

Alternatively, use (posix_)fadvise and (POSIX_)FADV_DONTNEED to push unneeded data out of the page cache proactively.

In general, ‘due’ to how memory limits on CGroups works (at least when used through resource limits on Pods in K8s), which includes accounted kmem (and, as such, block caches), I think it’s advisable not to set memory limits on stateful workloads using some kind of local volume/FS.

The page cache being included in the memory accounting of a Pod makes a lot of sense in a way (it’s something to take into account when sizing systems anyway…), but at the same time most applications not being equipped to ‘manage’/limit the amount of cache they occupy (unlike e.g., memory) and getting OOM-killed as a result is troublesome. If the kernel would prioritize flushing the page cache occupied by data from processes in cgroups that are about to hit their memory limit, and maybe use uncached IO once such limit is reached (i.e., not let a Pod occupy more memory than allocated to it, i.e., it’s sized for, but also not halt servicing IO or OOMkill the process), that could be nice… This slower IO should then be observable through (application) metrics, at which point sizing/limits can be adjusted.

At this point in my research I’m wondering if drop_caches releases active page cache because it actually first moves pages to the inactive_list, then evicts from the inactive_list. And if something like that is happening, then maybe it’s not possible to determine what from the active_list cold be dropped without iterating over it, which is not something cadvisor or kubelet would do. I guess I was hoping there’d be some stats exposed somewhere that could be used as a heuristic to determine with some reasonable approximation what could be dropped without having to do anything else, but maybe that just doesn’t exist. But if that’s the case, then it I wonder how it’s possible to use memory eviction policies effectively.

At this point I’m sufficiently dizzy from reading various source code and documents, and I’m just going to shut up now.

I forget where I read it but I heard that cgroups v2 was supposed to fix this. I think all you need to do is use a linux distro with cgroup v2 + update to kube 1.25 which makes kube cgroup v2 aware.

https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#:~:text=cgroup v2 provides a unified,has graduated to general availability.

It might be worth testing if this is still an issue in the latest version.

It’s quite obvious that cache memory is reclaimable memory, which can be freed nearly instantly if a process calls a new malloc() and therefore it should not be counted in the pod memory usage. Otherwise it leads to wrong hardware allocation decisions, gives very confusing numbers in kubectl top pod, etc.

For anything but performance tuning, the memory used by page cache must be considered the same as free memory.

In my particular case, I have a pod that uses 30 MB of RSS memory at most (there’s no swap), but because of intensive disk operations it uses a significant amout of page cache memory, and so kubectl top pod reports its usage at 3+ GB, which is clearly not the actual amount of memory the pod requires to run (cache is welcome, but not required). This also affects the kubectl top node metrics, which apparently also affects the autoscaling decision, which in turn leads to wasted money to run the nodes which otherwise would not be required.

This is not expected behavior. The OS caching the memory has been around for a long time. Any app looking at memory usage should consider the cached memory. Using nocache is not an ideal solution either. Is there anyway we can bump up the severity/need on this issue? We’re planning to go into production soon but can’t without this issue getting fixed

https://github.com/linchpiner/cgroup-memory-manager I am using this for workaround, for now, it works ok

Interesting…the nodes I executed the above tests on were GKE COS nodes

$ cat /etc/os-release
BUILD_ID=10323.12.0
NAME="Container-Optimized OS"
KERNEL_COMMIT_ID=2d7de0bde20ae17f934c2a2e44cb24b6a1471dec
GOOGLE_CRASH_ID=Lakitu
VERSION_ID=65
BUG_REPORT_URL=https://crbug.com/new
PRETTY_NAME="Container-Optimized OS from Google"
VERSION=65
GOOGLE_METRICS_PRODUCT_ID=26
HOME_URL="https://cloud.google.com/compute/docs/containers/vm-image/"
ID=cos
$ uname -a
Linux gke-yolo-default-pool-a42e49fb-1b0m 4.4.111+ #1 SMP Thu Feb 1 22:06:37 PST 2018 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Swapping the node out for an Ubuntu node seems to correct the memory usage. The pod never uses more than 600 MiB of RAM according to kubectl top pod.

This is looking more and more like some kind of memory leak or misaccounting that’s present in the 4.4 series kernel used in COS nodes but not in the 4.13 series used by Ubuntu nodes.

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
$ uname -a 
Linux gke-yolo-pool-1-cb926a0e-51cf 4.13.0-1008-gcp #11-Ubuntu SMP Thu Jan 25 11:08:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

/remove-lifecycle rotten

it seems like we’re affected by this problem as well. with tightly packed containers, long running jobs involving heavy disk I/O sporadically fail.

take this example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: democlaim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ssd
  resources:
    requests:
      storage: 1.2Ti
---
apiVersion: batch/v1
kind: Job
metadata:
  name: demo
spec:
  template:
    spec:
      containers:
      - name: demo
        image: ubuntu
        command: ["bash",  "-c", "apt-get update ; apt-get install -y wget ; wget -O /data/zstd.deb https://packages.shopify.io/shopify/public/packages/ubuntu/xenial/zstd_1.2.0-0shopify_amd64.deb/download ; wget -O /data/libzstd.deb https://packages.shopify.io/shopify/public/packages/ubuntu/xenial/libzstd1_1.2.0-0shopify_amd64.deb/download ; dpkg -i /data/libzstd.deb /data/zstd.deb ; echo 'KLUv/QRQtR8ApnmvQMAaBwCp6S2VEGAQoIMR3DIbNd4HvrRTZ9cQVwgYX19vUMlci2xnmLLgkNZaGsZmRkAEFuSmnbH8UpgxwUmkdx6yAJoAhwDu8W4cEEiofKDDBIa1pguh/vv4eVH7f7qHvH1N93OmnQ312X+6h8rb+nS0n/eh6s+rP5MZwQUC7cOaJEJuelbbWzpqfZ6advxPlOv6Ha8/D2jCPwQceFDCqIIDoAAASmhMkDoVisCA6fmpJd0HKRY7+s/P0QkkGjVYP2dNCGq1WHe1XK2WqxUkwdVCGetBQRRYLBbrNFTEjlTgMLEiZmLIRYgWT9MzTQ+Uo2AUoWhAWFQB7iFvo6YSZNHNSY5U9n92D5W3/6d7P2+jZv8DWFs0oHjNZLU27B4qb/9P93gfavaf7iEWSYETxQOO2GqrJfH2Nd1b3X66d14DQbo8veCxY7W1GR1/uP2ne8jb16d7ADEH3qhAALGBYQPxeek4lUJjBMlpJuuhC/H8R9Ltp3vnRc2n/+6hEm2jyhSYMUT1hcBqq935072ft1GfHajbf7qH523U59PP7qESLQIiSTJcec1k7eF/uvfzNuqz/7N7qLz9P1FkmMw4ovLCtNo6wDtvoz77T/dQeVuf7v3sIIR5qM/+0z1U3r4+He28D/X9p3t43tb8zw91++fBcKaIhddM1k3uIW9fn+7sHur2/3QPk8UFlROB1dYxtlHT/Wf3ULf/Z1ZoNs6IyBb85CSerVSKFAx41MB6w6/0P91D3r4+3Xkf6vbvkqHlgQOYnNdMVhOS7p23Nf+ze6jbf7p3BotKKQwMvPuxqwxtQZGGqcNqax0+2vGxUwUQqBInORbAsCc4/utsIMHIjtdMVpvD+red51xycsNhyg2m1VZb0r/NqyyLUo8lW8t5/jf62eehbv9P9/C8jfp0/9kZiGP0JkILoBIw7KSCLKhikCMAAQHAgoA+Yk8AAjQQEoAACAxBQFcEQGCAcCuCACAAiAYIqDACVyuMeO/lZnP49YuJifET/DqMhFOzZUJDc6W5kGD1OGhhORIUxs/EoaGhcmhoLdYShiQhNGm//E0IDUWEMnroe0JoaDQcaLcMp63yfIuKck/X8QoCDbQRklMBggIErDt1qfySehKEwet2c/0/MLRMEH5ZAxq+RlpgiN8BMOMwt+HwGvF3W2aM0KjIUT/Em+cFyAEQMGUIEjCG7YLmcKmhA6ySpQ7QIJao+Tr/Ygp+MGmXtAyBBdHa63eY+W9lcdCVFioqTUB7WITH0ZAfgx5TXMzXgcmge1Iy3CK3WCk0xRLDTbllx2Ar9yhMpUkwoEDYJnasQZrXT/4JjLxAaWX9iX77a1KsfrFu5j8fRZmwDg==' | base64 -d > /data/1TiB_of_zeroes.tar.zst.zst.zst ; echo -n 'Unpacking ' ; zstd -d -T4 < /data/1TiB_of_zeroes.tar.zst.zst.zst | zstd -d -T4 | zstd -d -T4 | tar -C /data -xvf - && echo successful. || echo failed."]
        securityContext:
          privileged: false
        volumeMounts:
          - name: data
            mountPath: /data
            readOnly: false
        resources:
          requests:
            cpu: "4"
            memory: "1Gi"
          limits:
            cpu: "4"
            memory: "1Gi"
      restartPolicy: Never
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: democlaim
  backoffLimit: 0

Test this with:

kubectl create namespace tarsplosion
kubectl --namespace tarsplosion create -f ./demo.yml
kubectl --namespace tarsplosion logs job/demo --follow

the latter command might take a moment to become available.

The job tries to unpack 1TiB of zeroes (triple-compressed with zstd) - and apparently fails because of memory exhaustion by buffers filled by tar.

There seems to be the problem like https://serverfault.com/questions/704443/tar-uses-too-much-memory-for-its-buffer-workaround

  • the job only fails sometimes, but then in a nasty fashion.

The zstd used is a vanilla 1.2.0 packaged for xenial - previous versions are not multithreaded and have a slightly different file format.

I also have the same issue in my k8s 1.15 cluster!!! any news regarding the issue???

The other option to consider here is to fast-track migrating to cgroup v2 and leverage https://www.kernel.org/doc/html/latest/accounting/psi.html per cgroup memory.pressure metric. Facebook have written about their usage and an userspace oomd killer here: https://facebookmicrosites.github.io/cgroup2/docs/memory-strategies.html

It’s not caused by the kernel at all. The kernel is correctly reporting that RAM has been used for cache. The kernel, as always, will reclaim cache when RAM is needed for other purposes.

This is a basic misunderstanding of how RAM management works, pretty common among novice system administrators:

The issue (I’d call it a design bug) is that kubernetes makes decisions to evict pods based on this misunderstanding.

/remove-lifecycle stale

As part of other investigations we’ve been recommended to use https://github.com/Feh/nocache to wrap the corresponding calls, which helped a fairly big amount 😃

Hi folks – I was suffering from this problem too, and I think I have a workaround. Main problem is I didn’t really know the nuts and bolts of cgroups, and so k8s’ behavior was super non-intuitive to me in this case. The answer came from something @bitglue said on this thread way back in 2018: “Perhaps it’s just a matter of documenting the consequences of not having a limit set.”

TL;DR: Set a memory LIMIT on your pod, not just a request, and the kernel will flush the page cache so as to keep your pod under the limit and save it from being evicted by k8s due to too much active memory usage.

Here’s a snippet from my Argo workflow (basically a Pod Spec if you’re not familiar with Argo):

  - name: qc
    container:
      image: gcr.io/natera-pfrm-4713/seqtool
      imagePullPolicy: Always
      command: [python, -c]
      resources:
        requests:
            memory: 10Gi
            cpu: 1
            ephemeral-storage: 100Gi
        limits:                <--- Added this line
            memory: 10Gi       <--- Added this line 

   etc.

More detail if it’s interesting:

My pods: Batch jobs that pull a ~30GB file down (it’s DNA sequencing data), do some work (utilizing not more than a couple of GB of malloc’d RAM at any point), then upload some modestly sized results.

My cluster: Nodes with ~30GB of RAM is the salient detail. Coincidentally, that’s about the size of my data files, so loading them entirely into page cache will max out the RAM on the machine.

@sjenning, @derekwaynecarr: has there been any kind of discussion about this issue within the k8s team? Are there any next steps for this issue?

I would like to propose that a change is made in kubelet to allow different metrics to be used when evaluating pod memory usage. That would allow people to opt into more “aggressive” memory metrics (like referenced_memory, introduced in cadvisor in https://github.com/google/cadvisor/pull/2495) to fix these bogus MemoryPressure events, while retaining the standard behavior for anyone who isn’t hitting this. Does that sound like a reasonable way forward?

I’m seeing a variant of this as well.

Scenario:

  • every 5 seconds, a readinessProbe invokes curl
  • curl calls access() a few thousand times, increasing dentry by ~5mb each time (as reported by `/proc/slabinfo)
  • This memory usage is filesystem cache, but is effectively billed/associated with the container itself.
  • Eventually the “memory usage” grows too large and pods are evicted as the kubelet believes the node is running out of memory.

This has a history of reports on CentOS 7 / Kernel 3.10. However, I am seeing it on a brand new GKE instance (Google Container OS 69, kernel 4.14.127+).

Here’s a pod definition that grows dentry kmem quickly by simulating what curl’s thousands of access() calls would do (create negative cache entries for files that don’t exist):

apiVersion: v1
kind: Pod
metadata:
  name: looper2
spec:
  containers:
  - name: example
    image: centos:7
    resources:
      requests:
        memory: 20Gi
      limits:
        memory: 20Gi
    command:
      - /bin/sh
      - -c
      - |
        while true; do
          # all of these cats will fail, we just want to try accessing files that don't exist for this experiment.
          cat /tmp/{$RANDOM,$RANDOM,$RANDOM}.{$RANDOM,$RANDOM,$RANDOM}.{$RANDOM,$RANDOM,$RANDOM,$RANDOM,$RANDOM}{a,b,c,d,e,f}{a,b,c}
        done

What does this look like in practice? Here’s what Google’s pod “Memory usage” chart shows for this pod after about 25 minutes:

image

In my test pod above, it’s worth noting that once the memory “usage” (cache included) hits the limit something appears to garbage collect and the usage drops back down. However, in production (on pods affected by this issue), the memory usage never drops back down even after hitting the limit, with dentry kmem occupying almost all of the memory accounting for the pod.

That said, I believe that filesystem cache should not count against a pod. What can we do to resolve this?

Hi @gcoscarelli , Did you see this issue? The issue happens on 3.10 kernel so it’s quite similar. https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-469662772

Hi, we are experiencing a similar issue. We have java based service running in pods that have 3GB QoS memory settings. The service is disk intensive (keeps git repos), so eventually we hit this limit. We first thought that java was being OOM’ed but after some tests we realized that just with a plain pod (no java) doing file creations (via dd command) we still reached that limit, Kernel memory went up and up and up according to /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes. The other thing we noticed is that even the container (not pod) is recreated after OOM, pod starts to be in Crash Loop path like if its memory was not being cleared between restarts. Only by deleting the pod made things go back to normal (until next crash of course). So my questions are:

  1. Do you think this is a bug between kernel and kubernetes (output from uname is 3.10.0-862.14.4.el7.x86_64) ?
  2. Is there a way to restart pods instead of restarting container when pod gets OOMKilled automatically?
  3. Increasing pod memory limit will only delay crash or eventually can generate memory pressure on the node that will make clear cash of opened files let pod memory go down? Any help is appreciated.

I have found MongoDB workloads seem to also consistently recreate this problem. I am having to give pods +50% memory (around 50% of the dataset size) to prevent them from getting evicted even when the node still has 40% of its system memory in a reclaimable state. As such I’m migrating MongoDB out of Kubernetes.

Edit: As an interim solution, wouldn’t it be possible to make whether kubelet considers this cache toward available configurable?

Edit 2: Systemd handles this by providing both a LimitRSS and MemoryLimit which seems like a good solution as it gives enough configurable options to the user to handle workloads like ones that Kubernetes is currently evicting.

Solution Edit: It did not dawn on me that the Linux kernel would be aware of the memory limit inside of the container and only fill up page cache to that point. I just set the memory limit and request of the MongoDB pod to the (node’s memory available - 3GB).

We’ve noticed that prometheus with thanos sidecars increases active_file in the prometheus container due to multiple processes reading the same file (prometheus and thanos-sidecar). active_file gets as big as the the amount of prometheus data files you have on disk as that’s what thanos is reading in full and shipping to S3.

All the containers in this scenario have a generous guaranteed memory limit.

With large enough prometheus data retention / metric count and limited enough memory, container_memory_working_set_bytes hovers between 90% and 95% of the container memory limit, but the kernel will start shrinking active_file once it needs more memory for either other files or prometheus needs more memory (RSS).

This makes it difficult to alert and know when prometheus instances with thanos sidecars are running out of memory and are close to OOM. I’m sure this applies to other heavy data workloads such as a mysql database with a backup sidecar.

In short, container_memory_working_set_bytes isn’t always accurately describing when a container is nearing OOM from the kernel. In most cases it does, but not always.

It would be great if container_memory_working_set_bytes or a different metric excluded active_file and any other forms of evictable memory so we could accurately alert on close to OOM’ing containers consistently.

I forget where I read it but I heard that cgroups v2 was supposed to fix this. I think all you need to do is use a linux distro with cgroup v2 + update to kube 1.25 which makes kube cgroup v2 aware.

https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#:~:text=cgroup v2 provides a unified,has graduated to general availability.

It might be worth testing if this is still an issue in the latest version.

Has anyone been able to confirm this?

/kind feature /lifecycle frozen 🏂 there isn’t a clear path forward, so we need somebody to adopt this and drive it 🤔 nobody in sig node is currently working on this.

I am also experiencing this problem. I will try the workaround of setting memory limit to memory request but, I would like to know if there are any plans to update how kubernetes calculates memory used (specifically it should not count buffer cache to determine OOM trigger)

The work around of setting request = to limit did not work for us. The only thing we reliably use to get around this problem is by ensuring we are running with rhel kernel version of kernel-3.10.0-1075.el7 or greater in combination with the kernel setting --args=cgroup.memory=nokmem.

https://github.com/docker/for-linux/issues/841

sync && echo 1 > /proc/sys/vm/drop_caches or clear the disk data

After working with Google support about this issue impacting GKE, we have an issue filed here: https://issuetracker.google.com/issues/140577001

Ultimately, I was able to work around my immediate problem by identifying curl triggering this problem, but the underlying cause is still present: The kernel and/or Kubernetes are incorrectly attributing memory “usage” in the filesystem cache to containers/pods and using that attribution to determine when a pod should be evicted.

Some small workarounds:

  • Set a special env var for curl: NSS_SDB_USE_CACHE=no. This causes curl to avoid creating massive amounts of dentry per invocation
  • Set memory limit to be the same value as memory request. This helps the kernel purge the file system cache when the pod hits this memory limit (where “hits” includes counting file system / dentry cache).

I don’t know where the bug (or bugs) is, but I’m likely at the extent to which I can contribute to identifying the symptoms and areas affected. With this workaround, I’m able to spend less time on this problem.

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle rotten /remove-lifecycle stale

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

Here’s an idea: cgroups has a concept of a memory “soft limit”. The kernel docs describe them as:

  1. Soft limits

Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided

a. There is no memory contention b. They do not exceed their hard limit

When the system detects memory contention or low memory, control groups are pushed back to their soft limits. If the soft limit of each control group is very high, they are pushed back as much as possible to make sure that one control group does not starve the others of memory.

Please note that soft limits is a best-effort feature; it comes with no guarantees, but it does its best to make sure that when memory is heavily contended for, memory is allocated based on the soft limit hints/setup. Currently soft limit based reclaim is set up such that it gets invoked from balance_pgdat (kswapd).

This sounds pretty much identical to the kublet concept of “memory pressure”, right?

I haven’t read the code or tried to use soft limits, but AFAIK kubernetes makes no use of this feature. Being in the kernel, cgroups is much better positioned to actually know what can and cannot be reclaimed. Also, it can actually reclaim RAM, whereas kublet can only terminate the entire process.

So assuming soft limits work (again, I haven’t tested it at all), I would propose deleting all the memory pressure functionality in kublet, and instead setting the memory request as the cgroup soft limit.

I am also experiencing this problem. I will try the workaround of setting memory limit to memory request but, I would like to know if there are any plans to update how kubernetes calculates memory used (specifically it should not count buffer cache to determine OOM trigger)

@jordansissel I am having the same problem for apache spark container running on K8S and I am using redhat linux 7.6.It is noticed container memory grows over the period of time and it dies eventually with OOM I tried you solution to set the environment variable and it doesn’t seem to be working.

We don’t currently have a reliable reproducer for this, but we often hit this when restoring large PostgreSQL backups with pg_basebackup. A particularly horrible but effective hack to help the backup restore process complete is to exec into the pod and sync; echo 1 > drop_caches repeatedly as suggested above (it also helps to sigstop/cont the backup process while flushing the cache).

Is there a good way to fix this without a change to the kernel’s implementation of cgroups, though? Should this perhaps rather be a kernel bug?