runtime: Multiple 'System.OutOfMemoryException' errors in .NET 7

I’m seeing an issue very similar to this one when running a memory-heavy app on a linux container with memory limit >128GB RAM.

The app started throwing random OutOfMemoryException in many unexpected places since we migrated to net70, while under no memory pressure (usually with more than 30% free memory).

I can see the original issue was closed, but I’m not sure if it was fixed on the final net70 release or if the suggestion to set COMPlus_GCRegionRange=10700000000 is the expected workaround.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 5
  • Comments: 68 (35 by maintainers)

Most upvoted comments

This would be fixed in 7.0.3.

can you try if setting COMPlus_GCName=clrgc.dll or COMPlus_GCName=libclrgc.so make the OOMs go away? We are working on a fix, but hoping this could be a temporary workaround. Thx.

@marcovr, it most likely would be, but would be good to validate your scenario on the latest release just in case. Btw, we have merged the fix into the 7 servicing branch yesterday so it should be available with the Feb servicing release. Thx.

Setting COMPlus_GCName=libclrgc.so resolves the issue for our setup with .NET7

ok good to know, yeah like Maoni suggests getting a dump or trace can help confirm whether its the same issue. We hope to get it fixed in an upcoming servicing release.

sorry, I was not aware of this till now… apologize for the delay and thank you so much for the repro, @arian2ashk.

I took a quick look and can definitely repro. this is because we are retaining WAY more memory with the default implementation (which we shouldn’t). the actual heap doesn’t diff that much but after a run with k6 the default impl is retaining a lot of memory in free_regions[1] and a lot in global_regions_to_decommit, eg it retains ~5gb in decommit -

0:077> ?? coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions
unsigned int64 0x00000001`27127000

the app does almost exclusively BGCs. @PeterSolMS, could you please take a look?

in .NET 7 we have enabled new Regions functionality within the GC. Here are the details: https://github.com/dotnet/runtime/issues/43844. Since this was a foundational change, we also shipped a separate GC which keeps the previous “Segments” functionality – in case there are some issues like this one. Going forward, we do plan to use a similar mechanism to release newer GC changes and could have multiple GC implementations sometime in the future.

ok thanks for trying it out. We will do additional validation and add it to a .NET 7 servicing release (due to holidays might be in Feb).

@Maoni0 @hoyosjs good news: found the issue and it was not related to the .NET runtime. The memory allocator used by RocksDB by default on Linux can severally leak memory, and switching to Jemalloc fixed the issue on the server we’re observing the problem. Thanks again for the support and we can close the issue now!

The original issue we reported in this issue above of OutOfMemoryException at low memory bounds has gone away (with COMPlus_GCName=libclrgc.so, we’ve not removed that yet). But we might be experiencing something similar to the above now, not sure how to get coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions but we have a few core dumps with a lot of native resident memory looking like this: image And also a warning from dotmemory there that frequent GC are happening (taking 85% of the time it thinks).

Edit It’s also interesting I noticed tighter GC performance on one of our processes by turning off server GC, it was shooting to about 2gi in a 4gi limited container, but with <ServerGarbageCollection>false</ServerGarbageCollection> it restricted itself much better at around 0.6gi with no noticable performance impact. Since you mention background GC…

But not sure if we’ve got other problems kicking around, we’re multiprocess in a container and seeing some oomkiller hits on our processes, not sure how the GC accounts for HeapHardLimitPercent which will default to 75% of the cgroups limit for both processes in terms of the amount of allocated memory - I don’t think there is a way for it to tell if the allocated memory reported by the system belongs to processes within the running container so I’d expect problems in this multiprocess scenario anyway…

I had another look at the free_regions[1] issue, and the amount of memory there appears to be normal for the repro scenario. The amount of memory in the LOH fluctuates a lot, and so we retain some memory rather than incurring the overhead of decommitting/recommitting the memory.

So that leaves the issue that we are not decommitting the memory we were planning to decommit - I will work out a fix for this.

I was able to repro this as well - debugging it I found there is indeed a flaw in our logic that causes us to stop decommitting regions if we do BGCs almost exclusively as in the test case. The fix to this issue should be fairly simple.

Not clear yet why so many regions end up in free_regions[1], perhaps there is a second issue. I will investigate.

ok, yeah trying with libclrgc should help narrow it down. Let us know how it goes. Thx

Correct, shouldnt be related to WKS/SVR config. @qwertoyo, might be worthwhile to try the private shared above. We are hoping to release the fix in the next month. Thx!

I hit this issue with server GC, so I don’t think it will improve it

@mangod9 @Maoni0 just got the chance to test today the library you sent, and after a day of usage under load no issues so far!

I have copied a private libcoreclr.so at https://1drv.ms/u/s!AtaveiZOervriJhkWC64gVEV8dAHug?e=IyBaP3, if you want to give that a try. You will want to remove the COMPlus_GCName config.

would it be possible to try out a private fix? we could deliver a libclrgc.so to you and you could use it the same way you used the shipped version. that would be really helpful.

Thanks for reporting this issue. This looks like its separate than the original issue – we are investigating something similar with another customer. Would it be possible to share a dump when the OOM happens?