runtime: .NET 7 application hangs during GC

Description

We host our application in Azure App Service in the host dependent mode.

Each instance of our application can end up getting stuck in a busy loop where it eats 100% of the CPU and no other HTTP calls are able to make it into the app. This only started happening when we upgraded to .NET 7.

It looks like this is due to one thread being stuck performing a GC, so all the other managed threads are paused.

Our application has health checks that run against it. One of those checks runs ~1/second. This check ends up querying a pgsql database and pulling down records (10s-1000s). This is the place we have seen the hang occur, although this same code can be called by other parts of the system. Likely it is just the health check that is triggering due to the much higher frequency of that call being made.

Here are a few of the unmanaged portion of the GC allocate from two separate dumps taken during the high CPU times (this is me doing dump debugging in Visual Studio 2022):

image image

I believe it is running the Workstation GC based on the WKS:: namespace (and because the other threads are frozen waiting on the GC and because the GC is being performed directly on the user thread). I believe this is happening because the app service plan instance shows up as having 1 processor. We specify to use the Server GC in our configuration but I believe it will use the workstation one if there is only one processor detected.

In one case I generated three dumps 5 minutes apart. In all three cases the managed code had not moved and was in the same exact place. The second unmanaged stack above is the top of the stack in those cases (although the very top most function changes in each, which is why I thought allocate_in_condemned_generations was the culprit). Here is the managed portion of the stack from those dumps:

image

Here are some outputs from the CrashHangAnalysis MHT which AppService also kindly created (this is from one of the three dumps mentioned):

image

As you can see in the thread times almost all of the time was eaten up by Thread 48. In each successive 5 minute dump you can see it has eaten up 5 more minutes of CPU time:

image

Here is the full call stack for Thread 48 reported by the CrashHangAnalysis:

image

There are also multiple CLR runtimes loaded in the process. I believe that AppService itself loads the .NET Framework 4.8, we are using .NET 7.0.0:

image

I attempted to find this issue reported already but my searching turned up nothing 😦 Apologies if I missed it.

Reproduction Steps

At this point I don’t have a way to reproduce it reliably. One of our systems seems to be falling into this particular hole once a day or so. The SQL query on that system is a bit larger than other systems, but not massively so.

Expected behavior

To just have the GC complete its work and return to our application.

Actual behavior

It appears the GC gets stuck in WKS::gc_heap::allocate_in_condemned_generations. I see a retry label in there, perhaps it is stuck in an infinite loop?

Regression?

I believe so. We never saw this on .NET 6 (several sub-versions) and .NET 5. We ran similar code with earlier versions of .NET Core and never saw any issues either.

Known Workarounds

Perhaps downgrading to .NET 6? It is uncertain if this would happen with .NET 7.0.1, I have not gotten to test that yet. It is unclear if this is a workstation GC issue only; my guess is yes.

Configuration

.NET 7.0.0 (in the dumps this shows up as CLR version v7.0.22). Azure App Service Platform version: 99.0.7.620 Windows Server 2016 - 14393 Number of Processors - 1 Architecture: x64

Other information

It seems like some sort of infinite loop issue in the gc_heap::allocate_in_condemned_generations function. But that is me using my ā€œJump to Conclusionsā€ mat.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 95 (40 by maintainers)

Most upvoted comments

No worries! I am just glad we are able to help at all.

We all really dig .Net and appreciate all the great stuff you guys do.

ā€œInadvertent stress testā€ sounds like a decent band name!

Just let me know when there is a fixed clrgc DLL and we can install it.

@KenKozman-GE I also agree. I think there should be three albums: ā€˜gen0’, ā€˜gen1’ & ā€˜gen2’ then leave a gap to do a best of + unreleased materials called ā€˜Unmanaged Heap’.

On a side note mine is still running fine with the first patch. Approaching 24 hours, but it’s still within the window I’ve seen the issue occur. I might give it another 24 hours before I switch to try the new dll.

ā€œInadvertent stress testā€ sounds like a decent band name!

Just let me know when there is a fixed clrgc DLL and we can install it.

TIL: You can find the modules in use via the Kudu process explorer! Thanks @Simon-Gregory-LG!

Haha, @KenKozman-GE thanks for your pointers by the way. Helped me get there quickly.

TIL: 'TIL'!

This fix should be available in 7.0.7 released last week: https://devblogs.microsoft.com/dotnet/june-2023-updates/

@manandre the reason why I haven’t checked in is because I’ve been running more stress tests and hit multiple problems with that, - I hit some 8.0 problems (not in GC) when I tried to run stress on 8.0 so I ran this on 7.0 and I did hit another issue in the vicinity so I’d like to make a fix of that as well. should have it this week.

we will backport all fixes to 7.0 (the fix causing the infinite loop in allocate_in_condemned_generations that both Ken and Simon hit, and the other 2 showed up in verify_regions that only affected Ken).

@Maoni0 The next release for version 7.0 is planned on March 14th. Any chance these fixes will be part of it?

@rbouallou, absolutely. I just put libclrgc.so along with src and symbols at https://github.com/Maoni0/clrgc/tree/main/issues/80073/demotion_fix/v2/linux. please let me know how this works out for you.

We’ve been running with the patched libclrgc.so for just over a week now and it’s still running with no issues - the process would hang every ~2 days with the default runtime.

@Maoni0 You mentioned fixes will be backported to 7.0. Do you have any idea when the release will come out? Many thanks for the support.

So it’s still going strong with 7,0,323,56101 and no unresponsiveness or exceptions (see below). Whereas our instances without the patch have gone down several times in this period.

We restarted it once to deploy the second patch, but we’re again 3 days+ of stability with the new patch, so this looks very promising - amazing job @Maoni0 😃!

I take it that this hotfix is going to take a while before it makes the main .NET 7.0 runtimes and gets deployed to all the AppServices? In the interim should we consider this patch suitable for production deployment, or should we consider rolling back to .NET 6,0 for the time being?


Health Check Metrics

Red - .NET 7.0 version deployed Green - clrgc.dll patch 7,0,323,56100 deployed Dark Green - clrgc.dll patch 7,0,323,56101 deployed

image

Wait… they give you guys weekends? I got to talk to my agent.

I just looked at our guinea pig instance over here and it is still rumbling along no crashes or infinite loops since the latest install. But I don’t think our replication cycle was a reliable as @Simon-Gregory-LG. Maybe it will hit something this weekend.

@Maoni0 no problem!

Lots going on today so forgot to post the confirmation actually, so here it is all up and running ready to enjoy the weekend!:

image

I believe I know what the bug is. the problem is in process_remaining_regions, when we go through the remaining pinned plugs, for the region that contains the very last pinned plug, we just set its plan_gen_num to 0 (meaning to put this region in gen0, if we are doing a compacting GC). but this should have called decide_on_demotion_pin_surv instead (like the previous remaining regions with pinned plugs) which will decide whether we should put this region in gen0 based on how big pinned survival is. normally this would not cause a functional problem (it could cause a perf problem). but because there was that giant pinned plug it makes it not fit when it’s unpinned and we call allocate_in_condemned_generations on it. so to make this bug show up you need the following conditions to all be true -

  • we did a gen2 compacting GC that generates a pinned plug in a gen2 region almost as big as the whole region. my guess for the reason why there’s this giant pinned plug is because that gen2 region was already really compact so when we called allocate_in_condemned_generations on the non pinned plugs that are next to some pinned plugs in it we discovered we can’t move the non pinned plugs anyway so we artificially pinned them and formed a giant pinned plug. and during this GC those objects were no longer pinned so we have one giant non pinned plug.
  • this gen2 region needs to be the last region with pinned plugs;
  • this gen2 region hasn’t been consumed by allocate_in_condemned_generations yet so it was processed by process_remaining_regions;

so that’s very hard to hit. I’m making a fix. I need to test it before giving it to you so ETA tomorrow.

No worries. I know I have baked my noodle on various post-mortem analyses and it can be sometimes fun and sometimes frustrating (I like to think that may be what being a detective is like). I’ve wasted a not insignificant portion of my life chasing down various threading bugs that have been gross (although perhaps not anywhere near as complex as the GC stuff). Nothing quite like the feeling when you finally get it all figured out though 😃

No worries on the details, I can ā€œsort ofā€ follow the gist of it (and certainly as much as I need to). It all does sound quite interesting though 😃

I’ll try to get the clrgc.dll added in whenever you get it over to us, no rush.

thank you! I’m downloading the dump now and will take a look tonight.

Okay I have made additional changes based on @cshung’s recommendation to try to return this to how it was when it was happening more reliably.

Still no reproductions, but hopefully we will catch one soon.

Oh yeah we could do that I think.

I only mention the clrgc.dll bit above to help me remember and in case others see this type of issue in the future.

Anyway, just let me know and we can try to help however.

I took a quick look at the dump and hypothesize that it may be an infinite loop because we forgot about the possibility of pinning.

Roughly speaking, the allocate_in_condemned_generations method does something along this line:

allocate_in_condemned_generations(...size...)
    retry:
    if we hit a pin or run out of space
        if we hit a pin
            deal with it
            goto retry
        if we run out of space
            deal with it
        goto retry
    ...

The key idea is that the deal with it should deal with the situation so that we should not be running into a pin or running out of space, as such, the loop is meant to be run for at most 3 iterations or less. (1 for pin, 1 for space, last for succeed).

There are no synchronization primitives involved in this loop, so this is pretty much a case of an infinite loop.

There are no more pins in the pin queue (i.e. gc_heap::mark_stack_bos == gc_heap::mark_stack_tos (c.f. gc_heap::pinned_plug_que_empty_p)), therefore, the problem has something to do with space.

Looking at the implementation of gc_heap::size_fit_p, I understand why we get stuck running out of space.

We have a just-fit allocation context:

0:046> ?? gen->allocation_context.alloc_ptr
unsigned char * 0x000001d2`72c00020

0:046> ?? gen->allocation_context.alloc_limit
unsigned char * 0x000001d2`73000000

0:046> ?? size
unsigned int64 0x3fffe0

Note that 0x000001d272c00020 + 0x3fffe0 == 0x000001d273000000, the allocation context should fit the allocation exactly, probably prepared so by the deal_with_it code for space and that’s why it is so exact.

But then old_loc != nullptr, so the code in size_fit_p insists on requiring a bit more memory (i.e. + Align(min_object_size)), so that check will fail.

We are probably stuck in that situation and that is why the loop isn’t breaking. What baffles me is that the old_loc is already considered in the deal with it section for running out of space by being passed as an argument to grow_heap_segment, not sure why we aren’t processing that correctly.