runtime: segfault error 4 in libcoreclr.so

Description

Process crashes on a regular basis (few times a week).

Raven.Server[19552]: segfault at 7fb366943030 ip 00007fb6d0863a7a sp 00007fb6cdd93ea0 error 4 in libcoreclr.so[7fb6d0794000+2f0000]

We were able to take a coredump and this is what we got from lldb-3.9 on thread 19552:

(lldb) register read
General Purpose Registers:
       rax = 0x00000000012afc00
       rbx = 0x0000000001222da0
       rcx = 0x00007f8a77555720
       rdx = 0x0000000000000001
       rdi = 0x0000000001222da0
       rsi = 0x00007fb6cdd93f28
       rbp = 0x00007fb6cdd93f00
       rsp = 0x00007fb6cdd93ea0
        r8 = 0x0000000000000015
        r9 = 0x00007fb660206e48
       r10 = 0x0000060c183060c1
       r11 = 0x0000000000000000
       r12 = 0x0000000001222da0
       r13 = 0x0000000000000000
       r14 = 0x00007fb366943030
       r15 = 0x0000000000000001
       rip = 0x00007fb6d0863a7a  libcoreclr.so`SVR::gc_heap::mark_object_simple(unsigned char**, int) + 26
    rflags = 0x0000000000010206
        cs = 0x0000000000000033
        fs = 0x0000000000000000
        gs = 0x0000000000000000
        ss = 0x000000000000002b
        ds = 0x0000000000000000
        es = 0x0000000000000000

(lldb) bt
* thread #1: tid = 19552, 0x00007fb6d0863a7a libcoreclr.so`SVR::gc_heap::mark_object_simple(unsigned char**, int) + 26, name = 'Raven.Server', stop reason = signal SIGSEGV
  * frame #0: 0x00007fb6d0863a7a libcoreclr.so`SVR::gc_heap::mark_object_simple(unsigned char**, int) + 26
    frame #1: 0x00007fb6d0866d73 libcoreclr.so`SVR::GCHeap::Promote(Object**, ScanContext*, unsigned int) + 355
    frame #2: 0x00007fb6d078d21e libcoreclr.so`GcInfoDecoder::ReportUntrackedSlots(GcSlotDecoder&, REGDISPLAY*, unsigned int, void (*)(void*, Object**, unsigned int), void*) + 238
    frame #3: 0x00007fb6d078b816 libcoreclr.so`GcInfoDecoder::EnumerateLiveSlots(REGDISPLAY*, bool, unsigned int, void (*)(void*, Object**, unsigned int), void*) + 2598
    frame #4: 0x00007fb6d0618c43 libcoreclr.so`EECodeManager::EnumGcRefs(REGDISPLAY*, EECodeInfo*, unsigned int, void (*)(void*, Object**, unsigned int), void*, unsigned int) + 291
    frame #5: 0x00007fb6d06f43ce libcoreclr.so`GcStackCrawlCallBack(CrawlFrame*, void*) + 590
    frame #6: 0x00007fb6d06835ed libcoreclr.so`Thread::MakeStackwalkerCallback(CrawlFrame*, StackWalkAction (*)(CrawlFrame*, void*), void*) + 157
    frame #7: 0x00007fb6d0683821 libcoreclr.so`Thread::StackWalkFramesEx(REGDISPLAY*, StackWalkAction (*)(CrawlFrame*, void*), void*, unsigned int, Frame*) + 433
    frame #8: 0x00007fb6d0683bfc libcoreclr.so`Thread::StackWalkFrames(StackWalkAction (*)(CrawlFrame*, void*), void*, unsigned int, Frame*) + 204
    frame #9: 0x00007fb6d08b32cc libcoreclr.so`ScanStackRoots(Thread*, void (*)(Object**, ScanContext*, unsigned int), ScanContext*) + 364
    frame #10: 0x00007fb6d08b30f5 libcoreclr.so`GCToEEInterface::GcScanRoots(void (*)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) + 261
    frame #11: 0x00007fb6d085b46f libcoreclr.so`SVR::gc_heap::mark_phase(int, int) + 943
    frame #12: 0x00007fb6d085881d libcoreclr.so`SVR::gc_heap::gc1() + 461
    frame #13: 0x00007fb6d084af43 libcoreclr.so`SVR::gc_heap::garbage_collect(int) + 2659
    frame #14: 0x00007fb6d0849fe2 libcoreclr.so`SVR::gc_heap::gc_thread_function() + 674
    frame #15: 0x00007fb6d0849d36 libcoreclr.so`SVR::gc_heap::gc_thread_stub(void*) + 54
    frame #16: 0x00007fb6d08b5698 libcoreclr.so`(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char16_t const*)::$_1::__invoke(void*) + 168
    frame #17: 0x00007fb6d09c771d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
    frame #18: 0x00007fb6d20c26db libpthread.so.0`start_thread + 219
    frame #19: 0x00007fb6d12a8a3f libc.so.6`clone + 63
(lldb) setthread 1
(lldb) t 1
(lldb) * thread #1: tid = 19552, 0x00007fb6d0863a7a libcoreclr.so`SVR::gc_heap::mark_object_simple(unsigned char**, int) + 26, name = 'Raven.Server', stop reason = signal SIGSEGV
    frame #0: 0x00007fb6d0863a7a libcoreclr.so`SVR::gc_heap::mark_object_simple(unsigned char**, int) + 26
libcoreclr.so`SVR::gc_heap::mark_object_simple:
->  0x7fb6d0863a7a <+26>: movq   (%r14), %rcx
    0x7fb6d0863a7d <+29>: movq   %rcx, %rax
    0x7fb6d0863a80 <+32>: orq    $0x1, %rax
    0x7fb6d0863a84 <+36>: movq   %rax, (%r14)

We’ve run dotnet-dump analyze verifyheap command and it did not detect any corruption:

No heap corruption detected.
<END_COMMAND_OUTPUT>

Calculated the failed method offset 0x00007fb6d0863a7a - 0x7fb6d0794000 = 0xCFA7A and did addr2line:

$ addr2line -e libcoreclr.so -fCi 0xCFA7A
NibbleWriter::WriteEncodedU32(unsigned int)
??:?

Configuration

Which version of .NET is the code running on? 3.1.7
What OS and version, and what distro if applicable? Ubuntu 18.04
What is the architecture (x64, x86, ARM, ARM64)? x64
Do you know whether it is specific to that configuration? no

Other information

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 93 (66 by maintainers)

Most upvoted comments

I now believe this is really what has happened. I’ve created a simple test app with a method with a large frame and stepped through the code on Linux x64, stopping in the middle of the probing code. The clrstack sos command wasn’t able to unwind the stack correctly. Since the same code that clrstack uses is used for all kinds of unwinding of managed code on Unix, this confirms that it is a problem. In the current issue, I believe what happens is that the thread is suspended for hijacking when it is executing the probing code… We find that we are not at a GC safe location, so we need to hijack the return address. To do that, we unwind to the caller frame to find the return address location. The unwinder thinks that the SP points to the last pushed register (see rbx in my previous comment), so it derives the return address slot from that. However, the SP is already 0x4000 bytes lower due to the probing and so instead of patching the real return address slot, we patch a location in the middle of locals of that method. To do that, we save the value at that location (it would be the real return address without this issue) and put an address of OnHijackTripThread there. The thread suspension then waits for a while for the thread to reach that function, but it doesn’t happen since we haven’t really put it into the return address slot. And the current method code executes for longer than the time the hijacking waits for the thread to reach the OnHijackTripThread. So the thread is unhijacked, the original value being written back into the slot that we’ve thought is a return address slot. And that causes the corruption. The value we have read from there in the probing loop was a random garbage from memory, as the locals area was not cleared at that point yet. The thread later suspends itself when it ends up executing the gc_heap::try_allocate_more_space, the runtime suspension completes and GC starts walking threads. When it walks the thread with our problematic method, it fetches the garbage value from a location in the method locals, tries to report it as a GC reference and crashes.

janvorli on Nov 4, 2020

@gregolsky the fix was approved for the 3.1.12 release that should be released in about a month.

janvorli on Jan 14, 2021

Could you send us this file please?

Of course, I am sorry I haven’t shared it before. Here is the link to get it: https://1drv.ms/u/s!AkLV4wRkyHYhyiaEm1tZKVuZyBDd?e=rd5T4n

You’ll need to ungzip it.

janvorli on Nov 26, 2020

OK, thank you. I’ll build and share the binary today.

janvorli on Nov 10, 2020

Given that the issue is now known. Are there any work arounds? If we’ll upgrade to 5.0? Can we expect this to go away?

ayende on Nov 6, 2020

I’ve just got another idea on the possible cause of the problem. I have noticed that the method was tier-jitted. I wonder if the return address location was somehow obtained for the original non-optimized version and then used for the tier-jitted one. That could explain why it would point into the middle of a method’s locals. Can you please get me the output of the following commands? This will be huge, so it would be best to share it as a gist.

clru 0x00007f980e1d4d80

I also wonder - is it possible that you have COMPlus_TC_OnStackReplacement env variable set?

janvorli on Nov 3, 2020