runtime: GC does not release memory easily on kubernetes cluster in workstation mode

Description

  • Application is a simple Api for up/download of (large) files from blob storage
  • Deployed on kubernetes cluster
  • GC in workstation mode as we need memory to be released for monitoring, scaling etc.
  • In VS locally this works just fine
    • Directly or few minutes after 1 GB download memory is released
    • It is always a Generation 2G collection, smaller generations do not release image
  • However, on the cluster memory can stay up for hours: alt text

Configuration

  • .Net Core 3.1 App
  • Running in a docker container with 2Gi memory limit
  • AKS cluster, machines with 4 cores 16GB Ram
  • htop inside container: image

Other information

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 10
  • Comments: 71 (32 by maintainers)

Most upvoted comments

We are still struggling with high memory usage on only Linux k8s environments. I would appreciate it if this issue is not closed until the problem is solved.

@brendandburns not sure I understand this correctly:

@Kiechlus can you clarify what problems you are seeing with horizontal pod scheduling? The scheduler should only be considering sum(requested resources), not sum(limit resources) when scheduling.

The issue we have with Kubernetes HPA is that if we are not freeing memory, the HPA will be trying to keep number of our pods at maximum level and will never scale down them.

We have a very bursty workload where memory requirements spike up for pods for short periods of time and then are idle or lower most of the time.

I am working on an API proposal that might help with the situation. If you know you are going into idle mode. you can call that API, and have the GC to release the memory for you.

We’re also seeing similar issues with linux pods holding on to memory but haven’t had a chance to dig in very much. We have a very bursty workload where memory requirements spike up for pods for short periods of time and then are idle or lower most of the time. We have k8s memory request set to a normal usage ceiling and then limits as the upper boundary for spikes. The pods hold on to 200-300% of their request memory after the spikes even though when running locally and profiling in VS the GC reduces this ~90% after the peak (eg. k8s pod sitting at ~800mb used idle, local at <200mb idle). This could be some sort of linux specific memory leak (having issues getting dumps from k8s) but subsequent workloads don’t increase peak memory use so would have to be some sort of startup leak.

I’ve played with some of the GC env vars which don’t seem to have much affect and say there could be performance impacts. Ideally we wouldn’t want any perf hits while under CPU load, I just want memory to return to baseline after the peak like it does locally so that kubernetes can make smart decisions regarding pod allocations etc.

MALLOC_ARENAS_MAX is just a half of the story.

The general issue with glibc is that it does not return the memory to OS if the arena is not completely continuously empty. In other words, it does not punch holes in the allocated chunk automatically and require manual calls of malloc_trim() function, supposedly to prevent memory fragmentation.

Example C code and the description of the issue is available here for example: https://stackoverflow.com/questions/38644578/understanding-glibc-malloc-trimming

This is a very common issue for many applications which may have huge peak memory consumption but low regular memory usage. It hit me in squid, pm2, node.js.

The simplest solution is to use jemalloc, an alternative heap allocator. It’s usually as easy as LD_PRELOAD. Or use distro which doesn’t use glibc, such as Alpine with musl libc.

I never see pod memory going down too, it simply keep increasing, my node memory working set percentage is 120% something

To monitor gc need to be easier, can we not simply log this? Gc running and sucessfully free memory xxx bytes

So we had the breakthrough by switching from the default debian aspnet image to the alpine image! Instead of 250 Mi the pod had then mostly 65 Mi! Also, the memory was released (which is not really the case with Debian image) For microservices is generally recommended Alpine…

I never see pod memory going down too, it simply keep increasing, my node memory working set percentage is 120% something

To monitor gc need to be easier, can we not simply log this? Gc running and sucessfully free memory xxx bytes

@dotnet/gc

I am returning on this one cause we have issue to our k8s cluster and we are wasting a ton of resources. We have a simple api allocating 6 strings * 10kb . GC in Workstation mode.

var s1 = new string('z', 10 * 1024);

We hit the api with 600 rpm for 2 minutes. Memory goes up to to 230mb from 50mb initially and stays there forever.

image

When I run it locally the GC is called and memory is released fine as you can see in the dotmemory screenshot. I am trying to figure out if it is an issue of the GC or something else.

image

We’re also seeing similar issues with linux pods holding on to memory but haven’t had a chance to dig in very much. We have a very bursty workload where memory requirements spike up for pods for short periods of time and then are idle or lower most of the time. We have k8s memory request set to a normal usage ceiling and then limits as the upper boundary for spikes. The pods hold on to 200-300% of their request memory after the spikes even though when running locally and profiling in VS the GC reduces this ~90% after the peak (eg. k8s pod sitting at ~800mb used idle, local at <200mb idle). This could be some sort of linux specific memory leak (having issues getting dumps from k8s) but subsequent workloads don’t increase peak memory use so would have to be some sort of startup leak.

I’ve played with some of the GC env vars which don’t seem to have much affect and say there could be performance impacts. Ideally we wouldn’t want any perf hits while under CPU load, I just want memory to return to baseline after the peak like it does locally so that kubernetes can make smart decisions regarding pod allocations etc.

@plaisted We are facing a similar issue. After spikes the memory stays high so the HPA doesn’t scale down the pods. Did you manage the memory to be returned to the OS either with GC env vars or with GC Workstation mode? Did any of these work?

The controller should look like this:

Response.ContentLength = fileLength;
await blobLib.DownloadToAsync(Response.Body);

@davidfowl We are allocating the stream like this: var targetStream = new MemoryStream(fileLength);, where fileLength can be several GB. We made the experience that if we create the Stream without initial capacity, it will allocate much more memory in the end as the actual file size.

@Maoni0 You are right, we ran into OOM only in very few occasions. So memory is freed under pressure. But what we would need that it is freed immedeately after the controller returns and the Stream got deallocated.

Because the needs in a Kubernetes cluster are different. There are e.g. three big machines and Kubernetes schedules many pods on them. Based on different metrics it creates or destroys pods (horizontal pod autoscaling) @egorchabala.

If now some pod does not free memory eventhough it could, that means Kubernetes cannot use that memory for scheduling other pods and the autoscaling does not work. Also the memory monitoring is more difficult.

Is there any possibility to make GC release memory immedeately as soon as it is possible eventhough there is no high pressure? Do you still need a different trace or something the like?

@L-Dogg we are currently using version 11.2.2.

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@Maoni0 Sorry for my late response. The GC was called by it’s own in the local environment. I realized that they weren’t defining memory limits in the pods, thus the 75% heap size wasn’t applied. After defining memory limits the memory in the pods doesn’t go crazy up and stays within the limits with no performance degradation. Also we have defined autoscaling at 85% to allow to scale down. However even when there is a small amount of requests the pod memory is stuck close to the limits. I will try to capture a top level GC trace.

Hi @Maoni0, yes my mistake, by “Free objects” I meant the amount that is not actually used in memory and needs to be released by GC. For example, the last dump size I got is about 18GB, and the heap size is only about 400mb. When I look at it with WinDbg, the command !dumpheap -type Free gives the following output:

Count    TotalSize     Class Name
345354  8841424   Free

So if I’m not mistaken, it shows Free up to 8mb. As far as I know, GC does not release all the unused space back to the operating system. So in this case, the private memory which is normally about 1.5GB in this heap size in Windows environment, the memory size used by the pod in kubernetes linux container environment seems to be 18GB. The interesting thing is, as I said before, if I use a memory manager like mimalloc with LD_PRELOAD, this amount is half. But in any case I don’t know why the memory usage is so high. This only happens in kubernetes in a Linux container environment, we don’t have this problem in Windows.

If there was a memory leak, wouldn’t the heap size be closer to 18GB? And wouldn’t that show up in all the dumps I get? But as far as I can see, I see a maximum heap size of 1.5GB in the dumps. Somehow GC is not releasing the unused amount back to the OS. If I set the pod’s memory limit to 20GB, after a certain time the memory used by the pod reaches 20GB. But the heap size is 2 or 3% of that. The size of the heap increases during load but then decreases again when the load is gone. We don’t encounter OOM at this stage. But the total memory size doesn’t decrease either and stays same close to pod’s memory limit, as in this example stays at 20GB. This prevents us to scale down our application. It also affects the resource utilization very badly.

Our prod applications are hosted on azure AKS but we have custom k8s installations in our dev and test environments and the situation is the same there.

Hi @Maoni0, I’ve taken a lot of dump and analyzed before and likewise the amount of Free objects is huge, but I would be happy to take another dump and look at the symbols you will give me to solve this problem. Generally what I see in K8s Linux dumps is that if the memory is 20GB, the total heap size is about 400-500mb and the amount of Free objects is about 19GB. But I don’t see this in Windows dumps with the same code. If the heap is 400-500mb, the memory usage is at most 2-3GB. This is the behavior of two different environments with the same code. We also see the same behavior in our different applications. But especially in this application I sent the trace of, we see this difference more. Because in this application there is a lot of roslyn CSharpScript and a lot of dynamic assemblies are created because of the CSharpScripts. I checked them with the help of this document and they are unloading properly, and there aren’t any LoaderAllocator connected to this scripts. Also there are native interop operations common to all applications, and we have native libraries that use at most 300-400mb of memory. As I said, the interesting part is the significant difference in memory usage between the application with the same code in the Windows environment and in the Linux Kubernetes Container. As a side note, when I use mimalloc, jemalloc or tcmalloc with LD_PRELOAD, the memory usage is dropped by half from 20GB to about 10-11GB. If you send me the symbols I would be willing to look at them in the dump. I can even share the dump file privately if you request. Thanks again.

Hello @Maoni0 , thanks for your concern. As you mentioned in the document, I’ve collected 1800 seconds of trace from the start of the application (here). I don’t know if it matters but the application is a CoreWCF project and every 10 minutes if the application is idle and sometimes when the memory peaks, it manually calls GC.Collect in Aggressive mode. Although manually triggering Gc.Collect doesn’t seem to work well in kubernetes environments because of this issue, but in windows environments it helps to reduce the overall memory usage. When the trace collection was finished, the memory size of the application was showing about 20GB on the kubernetes dashboard and the total heap size was about 500mb

@Maoni0 The difference between request and limit is effectively the degree to which you are willing to allow overscheduling.

In your example, pod#0 would get evicted/oom-killed for having any value > 100Mb once the machine gets oversubscribed.

If your application can’t handle working correctly (if perhaps more slowly) with only 100Mb, then it needs to have a higher ‘request’.

Request == Guaranteed Limit == Optimistic/overcommitted

Pods that are > request are always at risk of being killed and restarted for out of memory.

Hi @Maoni0 I am not an expert with this but I can try to summarize the issues our Ops team reported.

In Kubernetes each pod defines a CPU/Mem request and a CPU/Mem limit [1]. When scheduling a pod, the scheduler reserves the requested resources.

However, a pod can consume up to its limit resources, before he sees OOM. If now many pods do not free memory as we have seen in our case, the schedular has less total resources on all machines to schedule new pods that are e.g. needed for horizontal pod autoscaling.

Additionally, the monitored memory does not represent the actual memory, as much memory could be freed by e.g. a full garbage collection.

@davidfowl This is to confirm that your solution works like a charm. Many thanks again.

[1] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

If now some pod does not free memory eventhough it could, that means Kubernetes cannot use that memory for scheduling other pods and the autoscaling does not work. Also the memory monitoring is more difficult.

this is the part I don’t understand, perhaps you could help me. do you know how much memory each pod is supposed to get? imagine GC did kick in right after your large memory usage and Kubernetes packed more pods onto the same machine. but now your process needs to allocate the similar amount of memory again but now it can’t. is this the desired behavior? is it an opportunistic thing? if you get OOM in one of your processes you treat it as a normal problem and just restart it?