runtime: .NET 6 container throws OutOfMemoryException

Description

When running inside a docker container that has a hard memory limit set, GCMemoryInfo reports using a higher value for HighMemoryLoadThresholdBytes than is available.

Reproduction Steps

  1. Use default multistage docker build described at https://aka.ms/containerfastmode - I noticed the issue with mcr.microsoft.com/dotnet/runtime:6.0-bullseye-slim
  2. Create a simple net6.0 console app
  3. Add
var gcMemInfo = GC.GetGCMemoryInfo();
Console.WriteLine($"Total Available:{gcMemInfo.TotalAvailableMemoryBytes}");
Console.WriteLine($"High Memory Threshold:{gcMemInfo.HighMemoryLoadThreshold}");
  1. Build container, e.g. docker build -t highmemtest -f .\Dockerfile .
  2. Run container with memory limit, e.g.: docker run --name highmemtest -it --memory=3g --rm highmemtest

With a limit of 3g I observe the following values:

   "totalAvailableMemoryBytes":2415919104,
"highMemoryLoadThresholdBytes":2899102924

Expected behavior

I would expect the high memory load threshold to be less than the available memory, so it has a chance to kick in and run the GC more aggressively prior to going OOM.

To that end, I would expect to see values similar to it running outside a container in windows, where I observe:

   "totalAvailableMemoryBytes":16909012992,
"highMemoryLoadThresholdBytes":15218111692

Or running it inside a container without a memory limit, where I observe:

   "totalAvailableMemoryBytes":13191815168,
"highMemoryLoadThresholdBytes":11872633651

Configuration

.NET Version: net6.0 OS: Observed on both Windows 10 and Amazon Linux 2 when running the mcr.microsoft.com/dotnet/runtime:6.0-bullseye-slim docker image Arch: x64 Specific to: cgroup memory limit when running docker container

Regression?

I did not confirm it, but I suspect .NET 5 exhibits the same behavior.

Other information

https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#heap-limit

The section for heap limit states:

The default value, which only applies in certain cases, is the greater of 20 MB or 75% of the memory limit on the container. The default value applies if:
-    The process is running inside a container that has a specified memory limit.
-    System.GC.HeapHardLimitPercent is not set.

When running the container with --memory=3g, /sys/fs/cgroup/memory/memory.limit_in_bytes is 3221225472 75% of that is 2415919104, which exactly matches the value of TotalAvailableMemoryBytes on GCMemoryInfo

https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#high-memory-percent

The section for High Memory Percent states:

By default, when the physical memory load reaches 90%, garbage collection becomes more aggressive about doing full, compacting garbage collections to avoid paging.

90% of the aforementioned 3221225472 is 2899102924 which matches the value of HighMemoryLoadThresholdBytes on GCMemoryInfo

I think the disconnect here is that HighMemoryLoadThresholdBytes does not take into consideration that the Heap Limit is 75% when running inside a container with memory limit and instead assumes a Heap Limit of 100%. It should either be updated to have the same container-awareness logic that TotalAvailableMemoryBytes ( e.g. When running inside a container that has a specified memory limit set, the default value is 68% ), or just default to 90% of the calculated TotalAvailableMemoryBytes value.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 7
  • Comments: 39 (22 by maintainers)

Most upvoted comments

Any updates on this issue?

@tmds I was just going to start investigating this (and the #50414) problem today.

I just want to confirm that we are also seeing this issue with .NET 6 and Docker Swarm. We were seeing a steady stream of System.OutOfMemoryException. Disabling Docker memory limits stopped the errors. The hosts have plenty of free memory and memory limits were set sufficiently above what the containers were actually using. This is with Docker version 24.0.2.

Are you also uncommenting the ArrayPool<byte>.Shared.Return?

Yes, though the numbers at the top may need to be tweaked for optimal reproduction of the issue with GC.AllocateUninitializedArray(allocationSize);. After I made the decision to go with ArrayPool.Shared I tweaked things to find the optimal reproduction case for that, so I don’t know how exactly it’ll behave with the current settings.

Also, because your app runs so close to the memory limit, it will trim buffers on gen2 GC also undoing the performance benefit of pooling:

If you look at the Trim code, it uses Utilities.GetMemoryPressure()

https://github.com/dotnet/runtime/blob/c344d64b05a5530fa3a633a2e993f7bb7ca163fb/src/libraries/System.Private.CoreLib/src/System/Buffers/TlsOverPerCoreLockedStacksArrayPool.cs#L190

Which in turn uses HighMemoryLoadThresholdBytes to determine memory load.

https://github.com/dotnet/runtime/blob/c344d64b05a5530fa3a633a2e993f7bb7ca163fb/src/libraries/System.Private.CoreLib/src/System/Buffers/Utilities.cs#L40-L48

This actually takes us back full circle to the original topic of this issue, that HighMemoryLoadThresholdBytes is higher than TotalAvailableMemoryBytes , which was stated in previous responses as being the default and by design and dialog then shifted to the OOM I’ve been encountering.

This means for a container with a 3 GB limit, the ArrayPool.Shared implementation will only start aggressively trimming if the Memory Load exceeds 2,609,192,631 - however the TotalAvailableMemoryBytes on a 3 GB container is just 2,415,919,104. This means when run on a container with a memory limit, ArrayPool.Shared is unaware that the app is running close to the memory limit and will never be able to aggressively trim.

Because an app runs under a limit with workstation GC does not mean the app will not go OOM under the same limit with server GC.

Based on previous comments I was lead to believe that a OOM situation on server GC that doesn’t occur on workstation GC was an anomaly, and that the server GC mode should attempt a full GC prior to going OOM:

another thing I should mention, if it’s not obvious, is that of course we would do a full compacting GC if we cannot commit memory based on the hardlimit.

yeah, if it doesn’t get OOM with workstation GC but does with Server, or if it doesn’t get OOM with a larger limit but with a smaller limit, that’s clearly an issue that should be investigated.

So at this point I don’t know what the expected behavior here is. I’m very curious to see what @janvorli s investigation reveals.

So I think the bug is in /src/coreclr/gc/gc.cpp which is unfortunately too large to display in github.

Now, I’m not a c++ dev, but I think I figured out what’s going on.

In gc.cpp, line 42772-42776 has the logic that sets the heap limit to 75%:

if (gc_heap::is_restricted_physical_mem)
{
    uint64_t physical_mem_for_gc = gc_heap::total_physical_mem * (uint64_t)75 / (uint64_t)100;
    gc_heap::heap_hard_limit = (size_t)max ((20 * 1024 * 1024), physical_mem_for_gc);
}

line 42978-42995 has the logic that sets the high memory percentage to 90%:

else
{
    // We should only use this if we are in the "many process" mode which really is only applicable
    // to very powerful machines - before that's implemented, temporarily I am only enabling this for 80GB+ memory.
    // For now I am using an estimate to calculate these numbers but this should really be obtained
    // programmatically going forward.
    // I am assuming 47 processes using WKS GC and 3 using SVR GC.
    // I am assuming 3 in part due to the "very high memory load" is 97%.
    int available_mem_th = 10;
    if (gc_heap::total_physical_mem >= ((uint64_t)80 * 1024 * 1024 * 1024))
    {
        int adjusted_available_mem_th = 3 + (int)((float)47 / (float)(GCToOSInterface::GetTotalProcessorCount()));
        available_mem_th = min (available_mem_th, adjusted_available_mem_th);
    }

    gc_heap::high_memory_load_th = 100 - available_mem_th;
    gc_heap::v_high_memory_load_th = 97;
}

It’s a bit confusing, but the comment actually only applies to the inner if (gc_heap::total_physical_mem >= ((uint64_t)80 * 1024 * 1024 * 1024)) part.

The important part is Line 42986 int available_mem_th = 10; and then Line 42993 gc_heap::high_memory_load_th = 100 - available_mem_th;

So that’s where the 90% comes from.

Fix

A possible fix would be: If in a restricted physical mem environment, default the high memory threshold to 68% instead of 90% ( to complement the 75% max heap size ). This would be the code change for that:

else
    {
        int available_mem_th = 10;
        // If the hard limit is specified, default to 68% instead of 90% of physical memory
        if (gc_heap::is_restricted_physical_mem)
        {
            available_mem_th = 32;
        }

        // We should only use this if we are in the "many process" mode which really is only applicable
        // to very powerful machines - before that's implemented, temporarily I am only enabling this for 80GB+ memory.
        // For now I am using an estimate to calculate these numbers but this should really be obtained
        // programmatically going forward.
        // I am assuming 47 processes using WKS GC and 3 using SVR GC.
        // I am assuming 3 in part due to the "very high memory load" is 97%.
        if (gc_heap::total_physical_mem >= ((uint64_t)80 * 1024 * 1024 * 1024))
        {
            int adjusted_available_mem_th = 3 + (int)((float)47 / (float)(GCToOSInterface::GetTotalProcessorCount()));
            available_mem_th = min (available_mem_th, adjusted_available_mem_th);
        }

        gc_heap::high_memory_load_th = 100 - available_mem_th;
        gc_heap::v_high_memory_load_th = 97;
    }

That would resolve the issue, however I think it is indicative of a bigger problem. memory_load is used throughout the file to determine how much memory % is in use. The problem is that it seems purely based on total_physical_mem, not heap_hard_limit.

I wonder if

void gc_heap::get_memory_info (uint32_t* memory_load,
                               uint64_t* available_physical,
                               uint64_t* available_page_file)
{
    GCToOSInterface::GetMemoryStatus(is_restricted_physical_mem ? total_physical_mem  : 0,  memory_load, available_physical, available_page_file);
}

could be updated to GCToOSInterface::GetMemoryStatus(heap_hard_limit ? heap_hard_limit : 0, memory_load, available_physical, available_page_file); or something along those lines, though I suspect that would have wide-reaching consequences that would require a very thorough audit.

@saber-wang your issue seems to be unrelated to this. Could you please create a separate bug with repro or dumps? Usually OOMs are not GC related, but the application not freeing memory appropriately.

Also for this particular issue this looks to be something to be fixed in ArrayPool?

@janvorli hah good catch, yeah I can’t account for the managed and native runtime parts. Glad to hear the repro is working!

@tmds yes, I’m aware of the implications of the await Task.Delay(3) - in fact, I purposefully placed it there because I wanted to draw attention as to why my production app had unexpected allocations. Below it, I have a commented out line for GC.AllocateUninitializedArray<byte>(allocationSize); which also worked to reproduce the issue.

I think the issue is a bit more severe than saying it uses extra memory. While technically correct, it consistently uses extra memory to the point where the pooling provides almost no benefit at all, and I would argue that most developers would not expect that behavior from ArrayPool.Shared. Since the amount of buffers per pool size is limited, and each thread has its own pool, it appears that any async usage / thread jumps between renting and returning can easily lead to one thread exhausting its available buffers, while the other thread discards returned buffers due to being at the limit.

There are 2 problems with this.

  1. The documentation for ArrayPool<T>.Shared makes no mention of this behavior. I feel it should caution against renting / returning on different threads. ( It would also be wonderful if a separate default pool could be offered that forgoes the PerCore cpu advantages in favor of reliably pooling memory in a async setting )

  2. Look at the usage examples for System.IO.Pipelines, which is the first search result. The recommended pattern is to concurrently read and process on 2 separate threads. In my production app this lead to repeated buffer allocations, which is one of the things that - as per the doc - the pipelines should prevent.

It’s of course possible that I mis-interpreted the behavior, but between #27748 and the behavior observed in both my production app and the reproduction sample, it definitely seems easy to run into unexpected allocations.

That extra memory needed as the app runs could be enough for the OOM.

The absolute memory requirements of the reproduction do not exceed the available memory, so it runs without issues on Workstation GC. Furthermore, on server GC it runs for a variable amount of time, so some number of iterations pass. It just seems like the GC tries to squeeze in a bit more than is available and fails out without running a GC cycle that could’ve prevented it.