runtime: Access violations around unmanaged LoaderAllocator when using dynamic (emitted with AssemblyBuilder) collectible assemblies
After migration to .NET Core 3.1 from 2.2 we’ve started facing dotnet runtime crash during our test runs:
The active test run was aborted. Reason: Test host process crashed : Fatal error. Internal CLR error. (0x80131506)
. The problem also occurs with .NET Core 3.0, both on Windows and Linux.
The crash is not 100% reproducible (tests need to be rerun multiple times to catch this situation) and unfortunately we were not able to locate the problem and build a small reproducible example. But there are several crash dumps showing different aspects of the problem.
We’re using AssemblyBuilder
to emit collectible (AssemblyBuilderAccess.RunAndCollect
) assemblies dynamically. Under some circumstances a try to use a type/object from this assembly produces access violation (reading memory near zero). When using AssemblyBuilderAccess.Run
instead of RunAndCollect
the issue is not reproduced. So it looks like the issue is related to introducing of collectible assemblies occurred in .NET Core 3.0.
There are several crash dumps collected on Windows we can share. I will add some thoughts on the analysis performed so far.
1. coreclr!LoaderAllocator::GetHandleValueFastPhase2
Dump: https://drive.google.com/file/d/1bZ80x7fQ_kkc3mus9KI1BT8Qi1fiNLpq/view?usp=sharing
Managed stack trace (copying only .NET part):
WinDbg output
0000001D68C3E1A8 00007fff9e7d6ab5 [HelperMethodFrame_PROTECTOBJ: 0000001d68c3e1a8] System.Signature.GetSignature(Void*, Int32, System.RuntimeFieldHandleInternal, System.IRuntimeMethodInfo, System.RuntimeType)
0000001D68C3E2E0 00007fff400c727e System.Reflection.RuntimeMethodInfo.FetchNonReturnParameters() [/_/src/System.Private.CoreLib/src/System/Reflection/RuntimeMethodInfo.cs @ 115]
0000001D68C3E330 00007fff4007be10 System.Reflection.RuntimeMethodInfo.GetParametersNoCopy() [/_/src/System.Private.CoreLib/src/System/Reflection/RuntimeMethodInfo.cs @ 357]
0000001D68C3E360 00007fff40079b5e System.RuntimeType.FilterApplyMethodBase(System.Reflection.MethodBase, System.Reflection.BindingFlags, System.Reflection.BindingFlags, System.Reflection.CallingConventions, System.Type[]) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2236]
0000001D68C3E3C0 00007fff400c7e4b System.RuntimeType.GetMethodCandidates(System.String, Int32, System.Reflection.BindingFlags, System.Reflection.CallingConventions, System.Type[], Boolean) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2455]
0000001D68C3E470 00007fff400ca4b2 System.RuntimeType.GetMethodImplCommon(System.String, Int32, System.Reflection.BindingFlags, System.Reflection.Binder, System.Reflection.CallingConventions, System.Type[], System.Reflection.ParameterModifier[]) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2750]
0000001D68C3E500 00007fff4007b488 System.RuntimeType.GetMethodImpl(System.String, System.Reflection.BindingFlags, System.Reflection.Binder, System.Reflection.CallingConventions, System.Type[], System.Reflection.ParameterModifier[]) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2736]
From unmanaged side this crash was produced by the following line (handleTable
is null
): https://github.com/dotnet/runtime/blob/6d39d5df3cc94cc7204ac187bc84da992dbe9d77/src/coreclr/src/vm/loaderallocator.inl#L135
By redoing some address arithmetic starting from mov rax [rdi+18h]
it can be determined that LoaderAllocatorObject
is 000001d8`1b940568
:
WinDbg outout
00007fff`9e7d6a86 488b4718 mov rax, qword ptr [rdi+18h]
00007fff`9e7d6a8a 488b4830 mov rcx, qword ptr [rax+30h]
00007fff`9e7d6a8e 488b4140 mov rax, qword ptr [rcx+40h]
00007fff`9e7d6a92 4885d2 test rdx, rdx
00007fff`9e7d6a95 0f846c24ecff je coreclr!TypeHandle::GetManagedClassObject+0x3f (00007fff`9e698f07)
00007fff`9e7d6a9b 488b8040020000 mov rax, qword ptr [rax+240h]
00007fff`9e7d6aa2 488b00 mov rax, qword ptr [rax]
00007fff`9e7d6aa5 4885c0 test rax, rax
00007fff`9e7d6aa8 0f845924ecff je coreclr!TypeHandle::GetManagedClassObject+0x3f (00007fff`9e698f07)
00007fff`9e7d6aae 488b4010 mov rax, qword ptr [rax+10h]
00007fff`9e7d6ab2 48d1ea shr rdx, 1
ACCESS VIOLATION -> 00007fff`9e7d6ab5 488b5cd008 mov rbx, qword ptr [rax+rdx*8+8] ds:00000000`00000230=????????????????
0:159> dt coreclr!LoaderAllocatorObject 000001d8`1b940568
+0x000 m_pMethTab : 0x00007fff`3ed1dc60 MethodTable
+0x008 m_pLoaderAllocatorScout : 0x000001d8`1b19f198 LoaderAllocatorScoutObject
+0x010 m_pSlots : (null)
+0x018 m_slotsUsed : 0n0
+0x020 m_methodInstantiationsTable : (null)
Going further into m_pLoaderAllocatorScout
:
0:159> dt coreclr!LoaderAllocatorScoutObject 0x1d81b19f198
+0x000 m_pMethTab : 0x00007fff`3ec144c0 MethodTable
+0x008 m_nativeLoaderAllocator : 0x000001d8`1afa2c08 LoaderAllocator
and to m_nativeLoaderAllocator
, the memory looks a bit strange:
WinDbg output
0:159> dt coreclr!LoaderAllocator 0x000001d8`1afa2c08
. . .
+0x288 m_fGCPressure : 58 -- boolean?
+0x289 m_fUnloaded : 15 -- boolean?
+0x28a m_fTerminated : 74 -- boolean?
+0x28b m_fMarked : 16 -- boolean?
+0x28c m_nGCCount : 0n472
+0x290 m_IsCollectible : 0
. . .
From the CLR stack this dynamic assembly does not look like a one to be unloaded. Here is the AssemblyBuilder
which owns the type on which the crash occurred:
0:159> !gcroot 000001d81afa46c8
Thread 1748:
0000001D68C3E5E0 00007FFF3F3D3452 Nezaboodka.ClientTypeBinder..ctor(Nezaboodka.ClientTypeSystem, System.Collections.Generic.IEnumerable`1<System.Type>, System.Collections.Generic.IEnumerable`1<System.Type>, Nezaboodka.Nzon.NzonSettings)
rdi:
-> 000001D81AFA46A8 Nezaboodka.Nzon.FormattersGenerator
-> 000001D81AFA46C8 System.Reflection.Emit.AssemblyBuilder
2. coreclr!MethodTable::GetLoaderAllocatorObjectForGC
Dump: https://drive.google.com/file/d/1RSjdjAJ0TTjpH5QmcmGIheoHCk-il2al/view?usp=sharing
The line produced the crash (looks like it tries to dereference invalid handle returned from GetLoaderAllocatorObjectHandle()
): https://github.com/dotnet/runtime/blob/6d39d5df3cc94cc7204ac187bc84da992dbe9d77/src/coreclr/src/vm/methodtable.cpp#L9889
There is an interesting unmanaged stack trace here showing GC and JIT parts:
WinDbg output
00 0000002d`379bccf8 00007ff9`19ba335c coreclr!MethodTable::GetLoaderAllocatorObjectForGC+0x1e [f:\workspace\_work\1\s\src\vm\methodtable.cpp @ 10161]
01 (Inline Function) --------`-------- coreclr!GCToEEInterface::GetLoaderAllocatorObjectForGC+0xc [f:\workspace\_work\1\s\src\vm\gcenv.ee.cpp @ 359]
02 0000002d`379bcd00 00007ff9`19a4760e coreclr!WKS::gc_heap::mark_through_cards_for_segments+0x1628ec [f:\workspace\_work\1\s\src\gc\gc.cpp @ 29019]
03 0000002d`379bce50 00007ff9`19a43402 coreclr!WKS::gc_heap::mark_phase+0x9be [f:\workspace\_work\1\s\src\gc\gc.cpp @ 20795]
04 0000002d`379be1c0 00007ff9`19a44a5c coreclr!WKS::gc_heap::gc1+0xb2 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 16346]
05 (Inline Function) --------`-------- coreclr!GCToOSInterface::GetLowPrecisionTimeStamp+0x5 [f:\workspace\_work\1\s\src\vm\gcenv.os.cpp @ 946]
06 0000002d`379be290 00007ff9`19a65dfc coreclr!WKS::gc_heap::garbage_collect+0x25c [f:\workspace\_work\1\s\src\gc\gc.cpp @ 17976]
07 0000002d`379be320 00007ff9`19ad41fe coreclr!WKS::GCHeap::GarbageCollectGeneration+0x1fc [f:\workspace\_work\1\s\src\gc\gc.cpp @ 36597]
08 (Inline Function) --------`-------- coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x12 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13832]
09 0000002d`379be370 00007ff9`19ad6d58 coreclr!WKS::gc_heap::try_allocate_more_space+0x24e [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13934]
0a (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0x11 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14369]
0b (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x58 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14400]
0c 0000002d`379be3f0 00007ff9`19a7c46e coreclr!WKS::GCHeap::Alloc+0x88 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 35827]
0d (Inline Function) --------`-------- coreclr!Alloc+0x18b [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 240]
0e (Inline Function) --------`-------- coreclr!AllocateObject+0x22d [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 1209]
0f 0000002d`379be420 00007ff8`baf6819f coreclr!JIT_New+0x31e [f:\workspace\_work\1\s\src\vm\jithelpers.cpp @ 2724]
LoaderAllocator
address seems to be 00007ff8`ba020a88
(and its’ boolean flags section looks suspicios too). m_hLoaderAllocatorObjectHandle
has the value of 0x00000000`00000080
and it produces the crash:
WinDbg output
00007ff9`19cae933 488b4830 mov rcx, qword ptr [rax+30h]
00007ff9`19cae937 488b4140 mov rax, qword ptr [rcx+40h]
00007ff9`19cae93b 488b8040020000 mov rax, qword ptr [rax+240h]
ACCESS VIOLATION -> 00007ff9`19cae942 488b00 mov rax, qword ptr [rax] ds:00000000`00000080=????????????????
0:165> dt coreclr!LoaderAllocator 00007ff8`ba020a88
. . .
+0x240 m_hLoaderAllocatorObjectHandle : 0x00000000`00000080 OBJECTHANDLE__
. . .
+0x288 m_fGCPressure : 0
+0x289 m_fUnloaded : 0
+0x28a m_fTerminated : 16
+0x28b m_fMarked : 1
+0x28c m_nGCCount : 0n1
+0x290 m_IsCollectible : ffffffffffffffe8
. . .
3. coreclr!LoaderAllocatorObject::GetHandleTable
Dump: https://drive.google.com/file/d/1n8p-JeOCrUS5qjBkOyiuQmP16kJYfAR4/view?usp=sharing
The crash is produced by the following line (loaderAllocator
of type LoaderAllocatorObject
is null
): https://github.com/dotnet/runtime/blob/6d39d5df3cc94cc7204ac187bc84da992dbe9d77/src/coreclr/src/vm/loaderallocator.cpp#L936
WinDbg output
00007ffc`54d2b8e8 488b8640020000 mov rax, qword ptr [rsi+240h]
00007ffc`54d2b8ef 488b08 mov rcx, qword ptr [rax]
00007ffc`54d2b8f2 488b4110 mov rax, qword ptr [rcx+10h] ds:00000000`00000010=????????????????
0:020> ? rsi
Evaluate expression: 1846619047024 = 000001ad`f3128470
0:020> dt coreclr!LoaderAllocator 000001ad`f3128470
. . .
+0x240 m_hLoaderAllocatorObjectHandle : 0x000001ad`ee457a10 OBJECTHANDLE__
. . .
+0x288 m_fGCPressure : 1
+0x289 m_fUnloaded : 0
+0x28a m_fTerminated : 0
+0x28b m_fMarked : 0
+0x28c m_nGCCount : 0n32763
+0x290 m_IsCollectible : 1
. . .
The most strange thing here is that this allocator is referenced from managed LoaderAllocatorScout
.
By dumping all LoaderAllocatorScout
objects from the managed heap and looking inside their m_nativeLoaderAllocator
objects for address 000001ad`f3128470
(unmanaged LoaderAllocator
) it can be determined that LoaderAllocatorScout
presents in managed heap and is not collected. However the handle from unmanaged to managed (m_hLoaderAllocatorObjectHandle
) is wrong.
WinDbg output
0:020> .foreach ($obj {!dumpheap -type LoaderAllocatorScout -short}){.echo $obj; !do $obj}
. . .
000001ad847a5dc8
Name: System.Reflection.LoaderAllocatorScout
MethodTable: 00007ffbf59dfb10
EEClass: 00007ffbf59e39a8
Size: 24(0x18) bytes
File: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.1.1\System.Private.CoreLib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ffbf516ecc0 4000ec9 8 System.IntPtr 1 instance 000001ADF3128470 m_nativeLoaderAllocator
. . .
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 42 (30 by maintainers)
Commits related to this issue
- Fix allocation of RuntimeTypeCache GC handle When there is a race calling RuntimeType.InitializeCache, each of the racing threads creates a new GC handle using new RuntimeTypeHandle(this).GetGCHandle... — committed to janvorli/runtime by janvorli 4 years ago
- Fix allocation of RuntimeTypeCache GC handle (#33243) When there is a race calling RuntimeType.InitializeCache, each of the racing threads creates a new GC handle using new RuntimeTypeHandle(this).... — committed to dotnet/runtime by janvorli 4 years ago
- Port to 3.1: Fix allocation of RuntimeTypeCache GC handle When there is a race calling RuntimeType.InitializeCache, each of the racing threads creates a new GC handle using new RuntimeTypeHandle(this... — committed to janvorli/coreclr by janvorli 4 years ago
- Port to 3.1: Fix allocation of RuntimeTypeCache GC handle (#28025) * Port to 3.1: Fix allocation of RuntimeTypeCache GC handle When there is a race calling RuntimeType.InitializeCache, each of the... — committed to dotnet/coreclr by janvorli 4 years ago
@vanashimko it got approved for the 3.1.4 release.
I also wanted to thank you a lot for your cooperation with investigating this nasty issue. I have merged the proper fix to the master branch yesterday (PR #33243) and I am going to create a PR to port it to 3.1 today and I hope to get it approved. I am not sure when is the next release slot for the 3.1 though, I’ll let you know once I have more details.
So thanks to the last dump, I’ve found what causes the double registration of the handle. The RuntimeType.InitializeCache uses the Interlocked.CompareExchange pattern to ensure the cache handle gets created just once even if multiple threads are racing for it. It calls RuntimeTypeHandle.GetGCHandle (which registers the handle) and then tries to set the m_cache to that using Interlocked.CompareExchange. If the m_cache was already set, it destroys the just created handle again, as it means that some other thread has won the initialization race. And the destruction is the problem, as it only destroys the handle, but doesn’t remove it from the cleanup list on the AssemblyLoaderAllocator. Some later call to the RuntimeTypeHandle.GetGCHandle gets the same handle and adds it again to the cleanup list. And when the AssemblyLoaderAllocator::CleanupHandles() is called at the loader allocator destruction time, we call DestroyHandle twice on it, as it is twice on that list.
Here is the piece of code I am talking about: https://github.com/dotnet/coreclr/blob/v3.1.1/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs#L2413-L2421
Now I need to figure out how to fix that properly. It seems that we need to make sure that handles obtained by RuntimeTypeHandle.GetGCHandle are never freed by the GCHandle.InternalFree that doesn’t know anything about the loader allocator. I am not sure yet what would the right fix for that. It seems that adding a RuntimeTypeHandle.FreeGCHandle and a method to AssemblyLoaderAllocator to remove an entry from the cleanup list would be one way.
@vanashimko to verify that it fixes the issue you are hitting completely, I’ll create a patched version of System.Private.CoreLib.dll with removed call to GCHandle.InternalFree from the InitializeCache method. It would just waste one or few GC handles, but the problem should be gone as the handles on the cleanup list would never be freed at other place than the cleanup list.
Great, that moved my understanding of the problem further. What happens is that the LoaderAllocator ends up being on the cleanup list twice. This is the relevant part of the stress log that shows that:
I am going to add some more instrumentation to catch the spot when the double-adding happens.
Sure, you can take a look at it here. The changes are just adding more stress log events and a hashtable to keep track of allocated handles. They are meant only to debug the issue and are a bit hacky. https://github.com/janvorli/coreclr/pull/new/loader-allocator-handle-debugging
This is the real stack trace:
Unfortunately it shows a path where the cleanup is expected to happen, so the problem seems to be that something else destroyed the handle incorrectly before. An additional possibility could be a race of multiple threads running the
AssemblyLoaderAllocator::CleanupHandles
and ending up popping the same handle from the cleanup list twice, but I don’t think it is that one based on the fact that in the stresslog, the calls to the DestroyHandle were quite far from each other.I’ll need to figure out how to add some new stresslog events / instrumentation to the runtime to find out more details. I’ll share another patched version of the coreclr with you once I have it.
Most of the method names you are seeing on the stack trace are not correct as I’ve not shared pdb file for the coreclr.dll. Let me open your dump locally and see.
Ok, great, thank you! I’ll try to get you something on Monday.
@davidwrighton has also told me today that he has recently seen a similar bug caused by double free of a GC handle that later results in double allocation of the same handle. So I’ve looked at the stress log again and I can see exactly that. Extending the list of relevant items from the stress log that I’ve shown before:
As you can see at 19.530733500 and 20.132816500, the same handle was destroyed. The handle was pointing to a LoaderAllocator at the first call to destroy and to nothing at the second call to destroy, so it seems like we have an issue with double freeing the LoaderAllocator handle in some cases in the runtime.
I’ll keep digging into it to pinpoint the case when it happens.
@vanashimko I’ll be OOF next week, so I won’t be able to try to look into it more before I get back. If you’d get more dumps with the stresslog enabled with stack trace similar to the cases 2 and 3, please share them so that we can get more data.
Ok, thank you for your answer, I’ll keep digging in the dumps.
The interesting thing is that it is not a LoaderAllocator (named as LoaderAllocatorObject on the native side), so then what you’ve considered to be a LoaderAllocatorScout was something else too and thus the same was true for the m_nativeLoaderAllocator:
But the MethodTable at frame 3 points to that as if it was a LoaderAllocator instance (you can find the MethodTable address from frame 4: this->m_asMT):
My guess is that the real LoaderAllocator was collected somehow and a new object of type System.RuntimeType+RuntimeTypeCache got allocated at the same address. Obviously, it should not have gotten collected as the MethodTable of an existing object still references it. But, getting all objects from the heap with this MethodTable shows:
So the object of that type is still referenced by a register in one of the stack frames of the current thread (frame 0b), yet the
LoaderAllocator
was unloaded as if there were no more references to it. That seems to indicate that for a single GC collection, the object was not found to be referenced anywhere and then resurfaced in therbx
register in that frame or GC somehow didn’t call theMethodTable::GetLoaderAllocatorObjectForGC
for this object.I keep looking, the other failure modes are also quite strange.