runtime: Segmentation fault on Linux using Release configuration started after upgrading to .NET Core 5

Description

I’m investigating an issue that started on a proprietary application after upgrading to .NET Core 5. The issue only occurs on Linux (linux-x64) when using the Release configuration; when running on Windows both Debug and Release work without issue and Debug seems to work without issue on Linux. I’ve traced the issue back to a method similar to the following:

public Result MyMethod(IItem item, Options options) {
   // Options is a struct that is fairly complex, it has constants/fields/properties/methods.
  if (item != null) {
      // Doing anything with item here causes the Segmentation fault.
      // However, the Segmentation fault does not happen every time this method is called. It is somewhat intermittent.
  }
}

Strangely, when Options is changed from a struct to a class, the issue is resolved. The nature of this issue makes me think it is a .NET Core 5 bug (rather than a bug with the code) but I am not certain how to investigate this further. I’ve tried recreating the issue in a separate project but haven’t had any luck yet. If anyone has advice on how to get to the root cause, please let me know. Almost seems like there is some strange underlying memory issue (like options is somehow stomping on item) but I’m not familiar with debugging such issues for .NET.

Configuration

The Linux environment I am using to investigate this is Ubuntu 20.04 on WSL2 (note that the issue was originally seen on Docker containers in another environment). This is the output of dotnet --info on Ubunutu:

.NET SDK (reflecting any global.json):
 Version:   5.0.200
 Commit:    70b3e65d53

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  20.04
 OS Platform: Linux
 RID:         ubuntu.20.04-x64
 Base Path:   /usr/share/dotnet/sdk/5.0.200/

Host (useful for support):
  Version: 5.0.3
  Commit:  eae88cc11b

.NET SDKs installed:
  5.0.200 [/usr/share/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 5.0.3 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 5.0.3 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET runtimes or SDKs:
  https://aka.ms/dotnet-download

A Windows 10 (version 1909 build 18363.1440) machine is being used to run the dotnet publish command to create the output that is being tested on Linux. This is the dotnet --info output on Windows:

.NET SDK (reflecting any global.json):
 Version:   5.0.103
 Commit:    72dec52dbd

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18363
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\5.0.103\

Host (useful for support):
  Version: 5.0.3
  Commit:  c636bbdc8a

.NET SDKs installed:
  2.2.104 [C:\Program Files\dotnet\sdk]
  3.1.300 [C:\Program Files\dotnet\sdk]
  5.0.103 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.All 2.1.25 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.2.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.25 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.2.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 5.0.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.25 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.2.2 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 5.0.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.4 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 3.1.12 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 5.0.3 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

To install additional .NET runtimes or SDKs:
  https://aka.ms/dotnet-download

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23 (14 by maintainers)

Most upvoted comments

@jeffrimko after more analysis I think this is indeed an instance of #49078, which will be fixed in 5.0.6, which should be out in a month or so.

I sent you an email about a possible workaround in the meantime.

@CallumDev I trimmed down the source code of LibreLancer.Fx.FxBasicAppearance::Draw and created a simple repro in #49780 that hits assert(!foundDiff) in LSRA that I believe reflects an issue you are seeing (although I can’t say this for sure since the checked JIT asserts earlier than where it was segfaulting with release runtime).

I opened a separate issue #49780 and assigned it to @sandreenko since we believe it’s related to HFA handling on arm64 and Sergey has more expertise in the area.

@jeffrimko I also don’t think the issue you are seeing is the same as the one that @CallumDev reported since the latter is arm64 only and not intermittent.

@dotnet/jit-contrib Can someone take a look at the failure on linux-x64?

@CallumDev I was able to identify the method where the crash occurs LibreLancer.Fx.FxBasicAppearance::Draw.

With release runtime when the method is pmi-d it will crash with SIGSEGV with the same symptoms as you reported

(lldb) run
Prepone for LibreLancer method 2894
PREPONE type# 460 method# 2894 LibreLancer.Fx.FxBasicAppearance::Draw
Process 16484 stopped
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV: invalid address (fault address: 0x29)
    frame #0: 0x0000007faf98a66c libclrjit.so`LinearScan::processBlockStartLocations(this=0x00000055557bc6d8, currentBlock=0x000000555578f4d0) at lsra.cpp:5092:60

Process 16484 launched: '/home/robox/echesako/Runtime_49489/dotnet-sdk-5.0.103-linux-arm64/dotnet' (aarch64)
(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV: invalid address (fault address: 0x29)
  * frame #0: 0x0000007faf98a66c libclrjit.so`LinearScan::processBlockStartLocations(this=0x00000055557bc6d8, currentBlock=0x000000555578f4d0) at lsra.cpp:5092:60
    frame #1: 0x0000007faf984ca0 libclrjit.so`LinearScan::allocateRegisters() [inlined] LinearScan::processBlockEndAllocation(this=0x00000055557bc6d8, currentBlock=<unavailable>) at lsra.cpp:4622:9
    frame #2: 0x0000007faf984c58 libclrjit.so`LinearScan::allocateRegisters(this=0x00000055557bc6d8) at lsra.cpp:5504
    frame #3: 0x0000007faf984514 libclrjit.so`LinearScan::doLinearScan(this=0x00000055557bc6d8) at lsra.cpp:1277:5
    frame #4: 0x0000007faf8ef550 libclrjit.so`ActionPhase<Compiler::compCompile(void**, unsigned int*, JitFlags*)::$_11>::DoPhase() [inlined] Compiler::compCompile(void**, unsigned int*, JitFlags*)::$_11::operator()() const at compiler.cpp:4930:54
    frame #5: 0x0000007faf8ef544 libclrjit.so`ActionPhase<Compiler::compCompile(void**, unsigned int*, JitFlags*)::$_11>::DoPhase(this=<unavailable>) at phase.h:64
    frame #6: 0x0000007faf9c3438 libclrjit.so`Phase::Run(this=0x0000007fffffcfe0) at phase.cpp:61:26
    frame #7: 0x0000007faf8ed160 libclrjit.so`Compiler::compCompile(void**, unsigned int*, JitFlags*) [inlined] void DoPhase<Compiler::compCompile(void**, unsigned int*, JitFlags*)::$_11>(_compiler=<unavailable>, _phase=<unavailable>, _action=<unavailable>)::$_11) at phase.h:78:11
    frame #8: 0x0000007faf8ed138 libclrjit.so`Compiler::compCompile(this=0x000000555578c6a8, methodCodePtr=0x0000007fffffd468, methodCodeSize=0x0000007fffffd64c, compileFlags=<unavailable>) at compiler.cpp:4931
    frame #9: 0x0000007faf8edf28 libclrjit.so`Compiler::compCompileHelper(this=0x000000555578c6a8, classPtr=<unavailable>, compHnd=<unavailable>, methodInfo=0x0000007fffffd688, methodCodePtr=0x0000007fffffd468, methodCodeSize=0x0000007fffffd64c, compileFlags=0x0000007fffffd480) at compiler.cpp:6128:5
    frame #10: 0x0000007faf8ed970 libclrjit.so`Compiler::compCompile(CORINFO_MODULE_STRUCT_*, void**, unsigned int*, JitFlags*) at compiler.cpp:5467:28
    frame #11: 0x0000007faf8ed958 libclrjit.so`Compiler::compCompile(this=0x000000555578c6a8, classPtr=0x0000007f3e8ed768, methodCodePtr=0x0000007fffffd468, methodCodeSize=0x0000007fffffd64c, compileFlags=0x0000007fffffd480) at compiler.cpp:5486
    frame #12: 0x0000007faf8ee8ac libclrjit.so`jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, JitFlags*, void*) at compiler.cpp:6770:45
    frame #13: 0x0000007faf8ee7f8 libclrjit.so`jitNativeCode(CORINFO_METHOD_STRUCT_*, CORINFO_MODULE_STRUCT_*, ICorJitInfo*, CORINFO_METHOD_INFO*, void**, unsigned int*, JitFlags*, void*) at compiler.cpp:6795
    frame #14: 0x0000007faf8ee7f4 libclrjit.so`jitNativeCode(methodHnd=0x0000007f3eaf7cf0, classPtr=0x0000007f3e8ed768, compHnd=0x0000007fffffd7c8, methodInfo=0x0000007fffffd688, methodCodePtr=0x0000007fffffd468, methodCodeSize=0x0000007fffffd64c, compileFlags=0x0000007fffffd480, inlineInfoPtr=0x0000000000000000) at compiler.cpp:6797
    frame #15: 0x0000007faf8f27e4 libclrjit.so`CILJit::compileMethod(this=<unavailable>, compHnd=0x0000007fffffd7c8, methodInfo=0x0000007fffffd688, flags=<unavailable>, entryAddress=0x0000007fffffd798, nativeSizeOfCode=<unavailable>) at ee_il_dll.cpp:273:14

With checked runtime it will abort earlier with an assertion at https://github.com/dotnet/runtime/blob/c636bbdc8a2d393d07c0e9407a4f8923ba1a21cb/src/coreclr/src/jit/lsra.cpp#L2262

Prepone for LibreLancer method 2894
PREPONE type# 460 method# 2894 LibreLancer.Fx.FxBasicAppearance::Draw

Assert failure(PID 16138 [0x00003f0a], Thread: 16138 [0x3f0a]): Assertion failed '!foundDiff' in 'LibreLancer.Fx.FxBasicAppearance:Draw(byref,int,float,float,LibreLancer.Fx.NodeReference,LibreLancer.ResourceManager,LibreLancer.Fx.ParticleEffectInstance,byref,float):this' during 'Linear scan register alloc' (IL size 276)

    File: /opt/code/src/coreclr/src/jit/lsra.cpp Line: 2262
    Image: /home/robox/echesako/Runtime_49489/artifacts/bin/testhost/net5.0-Linux-Release-arm64/dotnet

Process 16138 stopped
* thread #1, name = 'dotnet', stop reason = signal SIGTRAP
    frame #0: 0x0000007faf6998dc libclrjit.so`DBG_DebugBreak at debugbreak.S:7

Process 16138 launched: '/home/robox/echesako/Runtime_49489/artifacts/bin/testhost/net5.0-Linux-Release-arm64/dotnet' (aarch64)

The assertion is the same as in https://github.com/dotnet/runtime/issues/38772 but that one was fixed by https://github.com/dotnet/runtime/pull/39452.

I will take a look at the JIT dump and see what is going on.

@CallumDev Based on your call stack and the unique sequence of instructions where SIGSEGV was thrown I believe I was able to identify the corresponding location in the JIT source

src/coreclr/src/jit/lsra.cpp:5087
                assignPhysReg(targetRegRecord, interval);
   d1618:	mov	x0, x19
   d161c:	str	x3, [sp, #24]
   d1620:	str	x2, [sp, #8]
   d1624:	bl	cfd14 <LinearScan::assignPhysReg(RegRecord*, Interval*)>
   d1628:	adrp	x15, 22d000 <typeinfo for CorUnix::CSynchStateController+0x20>
   d162c:	ldp	x2, x14, [sp, #8]
   d1630:	ldr	x3, [sp, #24]
   d1634:	ldr	x15, [x15, #3496]
   d1638:	ldr	x12, [sp]
   d163c:	mov	w18, #0x42                  	// #66
   d1640:	mov	w17, #0x30                  	// #48
   d1644:	mov	w16, #0x41                  	// #65
   d1648:	mov	w0, #0xc                   	// #12
src/coreclr/src/jit/lsra.cpp:5089
            if (interval->recentRefPosition != nullptr && !interval->recentRefPosition->copyReg &&
   d164c:	ldr	x8, [x2, #8]
   d1650:	cbz	x8, d148c <LinearScan::processBlockStartLocations(BasicBlock*)+0x2f4>
   d1654:	ldrb	w9, [x8, #41]
   d1658:	tbnz	w9, #7, d148c <LinearScan::processBlockStartLocations(BasicBlock*)+0x2f4>
src/coreclr/src/jit/lsra.cpp:5090
                interval->recentRefPosition->registerAssignment != genRegMask(targetReg))
   d165c:	ldr	x9, [x8, #32]
src/coreclr/src/jit/lsra.cpp:5089
            if (interval->recentRefPosition != nullptr && !interval->recentRefPosition->copyReg &&
   d1660:	cmp	x9, x12
   d1664:	b.eq	d148c <LinearScan::processBlockStartLocations(BasicBlock*)+0x2f4>  // b.none
   d1668:	ldr	x8, [x8, #8]
src/coreclr/src/jit/lsra.cpp:5092
                interval->getNextRefPosition()->outOfOrder = true;
-> d166c:	ldurh	w9, [x8, #41]
   d1670:	orr	w9, w9, #0x2000
   d1674:	sturh	w9, [x8, #41]
   d1678:	cbnz	x21, d1454 <LinearScan::processBlockStartLocations(BasicBlock*)+0x2bc>
   d167c:	b	d1490 <LinearScan::processBlockStartLocations(BasicBlock*)+0x2f8>

inside LinearScan::processBlockStartLocations(BasicBlock*)

https://github.com/dotnet/runtime/blob/c636bbdc8a2d393d07c0e9407a4f8923ba1a21cb/src/coreclr/src/jit/lsra.cpp#L5087-L5093

Based on the symptoms seems that interval->getNextRefPosition() returns nullptr.

Let me think how I can debug this further

cc @dotnet/jit-contrib