runtime: Unexpected SegFault after long time running.

My Application encounter stranger segfault problem without any exception, it always occurs after several hours running. I suspect the issue lies in clr.

Using dotnet-dump to analyze the core dump, it shows

Loading core dump: CoreDump ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
> clrstack
Failed to request Module data from assembly.
OS Thread Id: 0x18f2 (0)
        Child SP               IP Call Site
00007F41D1FF64B8 00007f42d7889b05 [InlinedCallFrame: 00007f41d1ff64b8]
> pe
There is no current managed exception on this thread

PLS help me. The coredump file uploaded to http://104.207.146.131:8000/crash/log/CoreDump

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 44 (44 by maintainers)

Most upvoted comments

Sorry this took a while to get to. I see this is running on top of this version of the runtime: @(#)Version 4.700.19.46205 @Commit: 922429db0144dd6f3b4324805464dae82857512a. This is 3.0.0. Is it possible to test the behavior under a 3.1 (say 3.1.3) runtime? 3.0 is no longer supported and any fix that needs to be done on our side would not make it to 3.0. There’s also a lot of fixes that have gone in.

Back to the dump, what I can see so far is that something within BitcoinAdv.Arbitrage.dll starts a call into one of the TryStartNoGCRegion functions and ends up calling StartNoGCRegion in the runtime. Then the callstack looks something like this before calling the sigsegv handler and creating the dump:

Relevant native callstack
libpthread.so.0`___lldb_unnamed_symbol1$$libpthread.so.0 + 1                                                                                                                                                                                        
libcoreclr.so`WKS::gc_heap::mark_object_simple1(unsigned char*, unsigned char*) [inlined] WKS::CObjectHeader::GetMethodTable() const at gc.cpp:4033:65                                                                                              
libcoreclr.so`WKS::gc_heap::mark_object_simple1(unsigned char*, unsigned char*) [inlined] WKS::my_get_size(Object*) at gc.cpp:9490                                                                                                                  
libcoreclr.so`WKS::gc_heap::mark_object_simple1(oo=<unavailable>, start=<unavailable>) at gc.cpp:18754                                                                                                                                              
libcoreclr.so`WKS::gc_heap::mark_object_simple(po=<unavailable>) at gc.cpp:19297:17                                                                                                                                                                 
libcoreclr.so`WKS::GCHeap::Promote(ppObject=0x00007ffe53d15010, sc=<unavailable>, flags=0) at gc.cpp:35237:9                                                                                                                                        
libcoreclr.so`GcInfoDecoder::EnumerateLiveSlots(this=<unavailable>, pRD=<unavailable>, reportScratchSlots=<unavailable>, inputFlags=<unavailable>, pCallBack=<unavailable>, hCallBack=<unavailable>)(void*, Object**, unsigned int), void*) a   t gcinfodecoder.cpp:947:21                                                                                                                                                                                                                                                           
libcoreclr.so`EECodeManager::EnumGcRefs(this=<unavailable>, pRD=0x00007f461b7f7fb0, pCodeInfo=0x00007f461b7f7e38, flags=<unavailable>, pCallBack=(libcoreclr.so`GcEnumObject(void*, Object**, unsigned int) at gcenv.ee.common.cpp:148), hCal   lBack=0x00007f461b7f90b8, relOffsetOverride=<unavailable>)(void*, Object**, unsigned int), void*, unsigned int) at eetwain.cpp:5140:24                                                                                                                                               
libcoreclr.so`GcStackCrawlCallBack(pCF=0x00007f461b7f7c00, pData=0x00007f461b7f90b8) at gcenv.ee.common.cpp:283:18                                                                                                                                 
libcoreclr.so`Thread::MakeStackwalkerCallback(this=<unavailable>, pCF=0x00007f461b7f7c00, pCallback=(libcoreclr.so`GcStackCrawlCallBack(CrawlFrame*, void*) at gcenv.ee.common.cpp:201), pData=0x00007f461b7f90b8)(CrawlFrame*, void*), void*   ) at stackwalk.cpp:880:27                                                                                                                                                                                                                                                            
libcoreclr.so`Thread::StackWalkFramesEx(this=<unavailable>, pRD=<unavailable>, pCallback=<unavailable>, pData=<unavailable>, flags=34048, pStartFrame=0x0000000000000000)(CrawlFrame*, void*), void*, unsigned int, Frame*) at stackwalk.cpp:   960:26                                                                                                                                                                                                                                                                               
libcoreclr.so`Thread::StackWalkFrames(this=0x0000000001b8a850, pCallback=(libcoreclr.so`GcStackCrawlCallBack(CrawlFrame*, void*) at gcenv.ee.common.cpp:201), pData=0x00007f461b7f90b8, flags=34048, pStartFrame=0x0000000000000000)(CrawlFra   me*, void*), void*, unsigned int, Frame*) at stackwalk.cpp:1043:12                                                                                                                                                                                                                   
libcoreclr.so`ScanStackRoots(pThread=0x0000000001b8a850, fn=(libcoreclr.so`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:35173), sc=<unavailable>)(Object**, ScanContext*, unsigned int), ScanContext*) at gcenv.ee.cp   p:148:18                                                                                                                                                                                                                                                                             
libcoreclr.so`GCToEEInterface::GcScanRoots(fn=<unavailable>, condemned=<unavailable>, max_gen=<unavailable>, sc=<unavailable>)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) at gcenv.ee.cpp:177:13                               
libcoreclr.so`WKS::gc_heap::mark_phase(condemned_gen_number=2, mark_only_p=NO) at gc.cpp:20698:9                                                                                                                                                   
libcoreclr.so`WKS::gc_heap::gc1() at gc.cpp:16325:13                                                                                                                                                                                               
libcoreclr.so`WKS::gc_heap::garbage_collect(n=<unavailable>) at gc.cpp:17954:9                                                                                                                                                                     
libcoreclr.so`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=2, reason=reason_induced) at gc.cpp:36577:13                                                                                                                           
libcoreclr.so`WKS::GCHeap::GarbageCollect(int, bool, int) [inlined] WKS::GCHeap::GarbageCollectTry(generation=<unavailable>) at gc.cpp:36083:12                                                                                                    
libcoreclr.so`WKS::GCHeap::GarbageCollect(this=<unavailable>, generation=<unavailable>, low_memory_p=<unavailable>, mode=<unavailable>) at gc.cpp:36017                                                                                            
libcoreclr.so`WKS::GCHeap::StartNoGCRegion(this=0x0000000001b26130, totalSize=<unavailable>, lohSizeKnown=<unavailable>, lohSize=<unavailable>, disallowFullBlockingGC=false) at gc.cpp:36958:9                                                    
libcoreclr.so`GCInterface::StartNoGCRegion(totalSize=104857600, lohSizeKnown=NO, lohSize=0, disallowFullBlockingGC=NO) at comutilnative.cpp:1067:44

The object that was getting promoted

The token matches `AsyncStateMachineBox` in `AsyncTaskMethodBuilder`
dumpobj 0x00007f467c16fc98

Name:        <error>
MethodTable: 00007f46a51b9058
EEClass: 00007f46a51a5ac0
Size:        200(0xc8) bytes
File:        /usr/share/dotnet/shared/Microsoft.NETCore.App/3.0.0/System.Private.CoreLib.dll
Fields:                                                                                                                                                                                                                                                                                         MT    Field   Offset                 Type VT     Attr            Value Name
00007f46a1dda0e8  400084e       30         System.Int32  1 instance                0 m_taskId
00007f46a1dde5f0  400084f        8      System.Delegate  0 instance 0000000000000000 m_action
00007f46a135c798  4000850       10        System.Object  0 instance 0000000000000000 m_stateObject
00007f46a433cec0  4000851       18 ...sks.TaskScheduler  0 instance 0000000000000000 m_taskScheduler
00007f46a1dda0e8  4000852       34         System.Int32  1 instance         33555456 m_stateFlags
00007f46a135c798  4000853       20        System.Object  0 instance 00007f467c16fd60 m_continuationObject
00007f46a22f8e98  4000857       28 ...tingentProperties  0 instance 0000000000000000 m_contingentProperties
00007f46a1dda0e8  400084d      81c         System.Int32  1   static                0 s_taskIdCounter                                                                                                                                                                      
00007f46a135c798  4000854      5d0        System.Object  0   static 00007f467c06ac70 s_taskCompletionSentinel
00007f46a1dd60c8  4000855      820       System.Boolean  1   static                0 s_asyncDebuggingEnabled
0000000000000000  4000856      5d8                       0   static 0000000000000000 s_currentActiveTasks
00007f46a433c740  4000858      5e0 ...Tasks.TaskFactory  0   static 00007f467c06ac88 <Factory>k__BackingField                                                                                                                                                                    
00007f46a1e70d30  4000859      5e8 ...eading.Tasks.Task  0   static 00007f467c06acb0 <CompletedTask>k__BackingField                                                                                                                                                              
00007f46a433b9b8  400085a      5f0 ...g.ContextCallback  0   static 00007f467c06acf0 s_ecCallback
00007f46a1e70d30  400084c       28 ...eading.Tasks.Task  0 TLstatic  t_currentTask
 >> Thread:Value 2903:0000000000000000 291a:00007f467c170880 2920:00007f467c33b4a0 2b40:00007f467c7cd130 38a6:00007f467c856680 3f03:00007f467c89f720 47d5:0000000000000000 47d6:0000000000000000 47e7:0000000000000000 47f6:0000000000000000 47fd:0000000000000000 4812:00000  000000000 4832:0000000000000000 4836:0000000000000000 483b:0000000000000000 4840:0000000000000000 4844:0000000000000000 484b:0000000000000000 4855:0000000000000000 4860:0000000000000000 <<                                                                                     
00007f46a1e71ee8  4000809       38 ...ks.VoidTaskResult  1 instance 00007f467c16fcd0 m_result                                                                                                                                                                                    
00007f46a51c0e30  400080a        8 ...Private.CoreLib]]  0   static dynamic statics NYI                 s_Factory                                                                                                                                                               0000007f46a433e028  4000c87       40        System.Action  0 instance 0000000000000000 _moveNextAction                                                                                                                                                                             
00007f46a1e72db0  4000c88       50      _moveNextAction  1 instance 00007f467c16fce8 StateMachine                                                                                                                                                                                
00007f46a1e73880  4000c89       48 ....ExecutionContext  0 instance 00007f467c169b48 Context                                                                                                                                                                                     
00007f46a433b9b8  4000c86        8 ...g.ContextCallback  0   static dynamic statics NYI                 s_callback

We start marking the object. Looking at the Object* directly I see that the method table in it is 0x00007f46a51b9059 (differs by one from the one reported in SOS). Both output the same:

> dumpmt 00007f46a51b9059
EEClass: 00007F46A51A5AC0
Module: 00007F46A1354020
Name: <error>
mdToken: 00000000020003F5
File: /usr/share/dotnet/shared/Microsoft.NETCore.App/3.0.0/System.Private.CoreLib.dll
BaseSize: 0xc8
ComponentSize: 0x0
Slots in VTable: 21
Number of IFaces in IFaceMap: 3

After there’s a frame in libpthread (libpthread.so.0___lldb_unnamed_symbol1$$libpthread.so.0 + 1`) that can’t be read. Then the next frame is a segv handler where the context points at:

image

Looks like all that inline was tricking the debugger. The faulting call comes from inlining happening here: https://github.com/dotnet/coreclr/blob/4b5ae70e341bad3c9f25d33cfee58d2bb93d3db7/src/gc/gc.cpp#L18770 trying to get flags for if the method table has pointers. I can’t see how this can fail, unless the MethodTable or Object pointers that I’m looking at are just bogus. Manually performing this on the faulting tread’s object returns that it has no pointers.