runtime: .NET 7 application hangs during GC
Description
We host our application in Azure App Service in the host dependent mode.
Each instance of our application can end up getting stuck in a busy loop where it eats 100% of the CPU and no other HTTP calls are able to make it into the app. This only started happening when we upgraded to .NET 7.
It looks like this is due to one thread being stuck performing a GC, so all the other managed threads are paused.
Our application has health checks that run against it. One of those checks runs ~1/second. This check ends up querying a pgsql database and pulling down records (10s-1000s). This is the place we have seen the hang occur, although this same code can be called by other parts of the system. Likely it is just the health check that is triggering due to the much higher frequency of that call being made.
Here are a few of the unmanaged portion of the GC allocate from two separate dumps taken during the high CPU times (this is me doing dump debugging in Visual Studio 2022):

I believe it is running the Workstation GC based on the WKS:: namespace (and because the other threads are frozen waiting on the GC and because the GC is being performed directly on the user thread). I believe this is happening because the app service plan instance shows up as having 1 processor. We specify to use the Server GC in our configuration but I believe it will use the workstation one if there is only one processor detected.
In one case I generated three dumps 5 minutes apart. In all three cases the managed code had not moved and was in the same exact place. The second unmanaged stack above is the top of the stack in those cases (although the very top most function changes in each, which is why I thought allocate_in_condemned_generations was the culprit). Here is the managed portion of the stack from those dumps:

Here are some outputs from the CrashHangAnalysis MHT which AppService also kindly created (this is from one of the three dumps mentioned):

As you can see in the thread times almost all of the time was eaten up by Thread 48. In each successive 5 minute dump you can see it has eaten up 5 more minutes of CPU time:

Here is the full call stack for Thread 48 reported by the CrashHangAnalysis:

There are also multiple CLR runtimes loaded in the process. I believe that AppService itself loads the .NET Framework 4.8, we are using .NET 7.0.0:

I attempted to find this issue reported already but my searching turned up nothing š¦ Apologies if I missed it.
Reproduction Steps
At this point I donāt have a way to reproduce it reliably. One of our systems seems to be falling into this particular hole once a day or so. The SQL query on that system is a bit larger than other systems, but not massively so.
Expected behavior
To just have the GC complete its work and return to our application.
Actual behavior
It appears the GC gets stuck in WKS::gc_heap::allocate_in_condemned_generations. I see a retry label in there, perhaps it is stuck in an infinite loop?
Regression?
I believe so. We never saw this on .NET 6 (several sub-versions) and .NET 5. We ran similar code with earlier versions of .NET Core and never saw any issues either.
Known Workarounds
Perhaps downgrading to .NET 6? It is uncertain if this would happen with .NET 7.0.1, I have not gotten to test that yet. It is unclear if this is a workstation GC issue only; my guess is yes.
Configuration
.NET 7.0.0 (in the dumps this shows up as CLR version v7.0.22). Azure App Service Platform version: 99.0.7.620 Windows Server 2016 - 14393 Number of Processors - 1 Architecture: x64
Other information
It seems like some sort of infinite loop issue in the gc_heap::allocate_in_condemned_generations function. But that is me using my āJump to Conclusionsā mat.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 95 (40 by maintainers)
No worries! I am just glad we are able to help at all.
We all really dig .Net and appreciate all the great stuff you guys do.
@KenKozman-GE I also agree. I think there should be three albums: āgen0ā, āgen1ā & āgen2ā then leave a gap to do a best of + unreleased materials called āUnmanaged Heapā.
On a side note mine is still running fine with the first patch. Approaching 24 hours, but itās still within the window Iāve seen the issue occur. I might give it another 24 hours before I switch to try the new dll.
āInadvertent stress testā sounds like a decent band name!
Just let me know when there is a fixed clrgc DLL and we can install it.
Haha, @KenKozman-GE thanks for your pointers by the way. Helped me get there quickly.
TIL:
'TIL'!This fix should be available in 7.0.7 released last week: https://devblogs.microsoft.com/dotnet/june-2023-updates/
@manandre the reason why I havenāt checked in is because Iāve been running more stress tests and hit multiple problems with that, - I hit some 8.0 problems (not in GC) when I tried to run stress on 8.0 so I ran this on 7.0 and I did hit another issue in the vicinity so Iād like to make a fix of that as well. should have it this week.
@Maoni0 The next release for version 7.0 is planned on March 14th. Any chance these fixes will be part of it?
Weāve been running with the patched libclrgc.so for just over a week now and itās still running with no issues - the process would hang every ~2 days with the default runtime.
@Maoni0 You mentioned fixes will be backported to 7.0. Do you have any idea when the release will come out? Many thanks for the support.
So itās still going strong with 7,0,323,56101 and no unresponsiveness or exceptions (see below). Whereas our instances without the patch have gone down several times in this period.
We restarted it once to deploy the second patch, but weāre again 3 days+ of stability with the new patch, so this looks very promising - amazing job @Maoni0 š!
I take it that this hotfix is going to take a while before it makes the main .NET 7.0 runtimes and gets deployed to all the AppServices? In the interim should we consider this patch suitable for production deployment, or should we consider rolling back to .NET 6,0 for the time being?
Health Check Metrics
Red - .NET 7.0 version deployed Green - clrgc.dll patch 7,0,323,56100 deployed Dark Green - clrgc.dll patch 7,0,323,56101 deployed
Wait⦠they give you guys weekends? I got to talk to my agent.
I just looked at our guinea pig instance over here and it is still rumbling along no crashes or infinite loops since the latest install. But I donāt think our replication cycle was a reliable as @Simon-Gregory-LG. Maybe it will hit something this weekend.
@Maoni0 no problem!
Lots going on today so forgot to post the confirmation actually, so here it is all up and running ready to enjoy the weekend!:
I believe I know what the bug is. the problem is in
process_remaining_regions, when we go through the remaining pinned plugs, for the region that contains the very last pinned plug, we just set its plan_gen_num to 0 (meaning to put this region in gen0, if we are doing a compacting GC). but this should have calleddecide_on_demotion_pin_survinstead (like the previous remaining regions with pinned plugs) which will decide whether we should put this region in gen0 based on how big pinned survival is. normally this would not cause a functional problem (it could cause a perf problem). but because there was that giant pinned plug it makes it not fit when itās unpinned and we callallocate_in_condemned_generationson it. so to make this bug show up you need the following conditions to all be true -allocate_in_condemned_generationson the non pinned plugs that are next to some pinned plugs in it we discovered we canāt move the non pinned plugs anyway so we artificially pinned them and formed a giant pinned plug. and during this GC those objects were no longer pinned so we have one giant non pinned plug.allocate_in_condemned_generationsyet so it was processed byprocess_remaining_regions;so thatās very hard to hit. Iām making a fix. I need to test it before giving it to you so ETA tomorrow.
No worries. I know I have baked my noodle on various post-mortem analyses and it can be sometimes fun and sometimes frustrating (I like to think that may be what being a detective is like). Iāve wasted a not insignificant portion of my life chasing down various threading bugs that have been gross (although perhaps not anywhere near as complex as the GC stuff). Nothing quite like the feeling when you finally get it all figured out though š
No worries on the details, I can āsort ofā follow the gist of it (and certainly as much as I need to). It all does sound quite interesting though š
Iāll try to get the clrgc.dll added in whenever you get it over to us, no rush.
thank you! Iām downloading the dump now and will take a look tonight.
Okay I have made additional changes based on @cshungās recommendation to try to return this to how it was when it was happening more reliably.
Still no reproductions, but hopefully we will catch one soon.
Oh yeah we could do that I think.
I only mention the clrgc.dll bit above to help me remember and in case others see this type of issue in the future.
Anyway, just let me know and we can try to help however.
I took a quick look at the dump and hypothesize that it may be an infinite loop because we forgot about the possibility of pinning.
Roughly speaking, the
allocate_in_condemned_generationsmethod does something along this line:The key idea is that the
deal with itshould deal with the situation so that we should not be running into a pin or running out of space, as such, the loop is meant to be run for at most 3 iterations or less. (1 for pin, 1 for space, last for succeed).There are no synchronization primitives involved in this loop, so this is pretty much a case of an infinite loop.
There are no more pins in the pin queue (i.e.
gc_heap::mark_stack_bos == gc_heap::mark_stack_tos(c.f.gc_heap::pinned_plug_que_empty_p)), therefore, the problem has something to do with space.Looking at the implementation of
gc_heap::size_fit_p, I understand why we get stuck running out of space.We have a just-fit allocation context:
Note that
0x000001d272c00020 + 0x3fffe0 == 0x000001d273000000, the allocation context should fit the allocation exactly, probably prepared so by thedeal_with_itcode for space and thatās why it is so exact.But then
old_loc != nullptr, so the code insize_fit_pinsists on requiring a bit more memory (i.e.+ Align(min_object_size)), so that check will fail.We are probably stuck in that situation and that is why the loop isnāt breaking. What baffles me is that the
old_locis already considered in thedeal with itsection for running out of space by being passed as an argument togrow_heap_segment, not sure why we arenāt processing that correctly.