runtime: Segmentation fault on arm32 (raspberry-pi3)

From @SteveL-MSFT on August 29, 2017 22:25

After building powershell with runtime linux-arm, it runs until it hits a second ManualResetEvent::WaitOne() call and results in SegFault. Stack trace from gdb:

Thread 23 "powershell" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x694e1450 (LWP 11108)]
0x76692ecc in VirtualCallStubManager::predictStubKind(unsigned int) () from /home/pi/powershell/libcoreclr.so
(gdb) backtrace
#0  0x76692ecc in VirtualCallStubManager::predictStubKind(unsigned int) () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#1  0x766981d6 in VirtualCallStubManager::getStubKind(unsigned int) () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#2  0x766951b4 in VirtualCallStubManager::FindStubManager(unsigned int, VirtualCallStubManager::StubKind*) ()
   from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#3  0x7669698e in VSD_ResolveWorker () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#4  0x7673cb30 in ResolveWorkerAsmStub () from /home/pi/powershell/libcoreclr.so
dotnet/coreclr#5  0x687ca346 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

Copied from original issue: dotnet/corefx#23660

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18 (15 by maintainers)

Most upvoted comments

Fixed by dotnet/coreclr#13922

I’ve debugged the issue and it is a codegen issue. The ResolveWorkerAsmStub expects to get indirection cell address combined with two flag bits in the register R4, but it gets an address of an argument shuffling thunk instead. The managed frame (the frame dotnet/coreclr#5 in the stack trace in the issue description above) is a frame of the following function:

DomainNeutralILStubClass.IL_STUB_SecureDelegate_Invoke(System.__Canon, System.__Canon, System.__Canon, System.__Canon, System.__Canon)
=> 0xa87e9a24:  push    {r2, r3, r4, lr}
   0xa87e9a26:  ldr.w   lr, [sp, dotnet/coreclr#16]
   0xa87e9a2a:  str.w   lr, [sp]
   0xa87e9a2e:  ldr.w   lr, [sp, dotnet/coreclr#20]
   0xa87e9a32:  str.w   lr, [sp, dotnet/coreclr#4]
   0xa87e9a36:  ldr     r0, [r0, dotnet/coreclr#20]
   0xa87e9a38:  add.w   r4, r0, dotnet/coreclr#16
   0xa87e9a3c:  ldr     r4, [r0, dotnet/coreclr#12]
   0xa87e9a3e:  ldr     r0, [r0, dotnet/coreclr#4]
   0xa87e9a40:  blx     r4
   0xa87e9a42:  pop     {r2, r3, r4, pc}

This function calls an argument shuffling thunk via the blx r4. The thunk’s code is below:

=> 0xb5b062b0:  push    {r4, r5, r6, lr}
   0xb5b062b2:  ldr.w   r12, [r0, dotnet/coreclr#16]
   0xb5b062b6:  addw    r4, sp, dotnet/coreclr#16
   0xb5b062ba:  addw    r5, sp, dotnet/coreclr#16
   0xb5b062be:  mov     r0, r1
   0xb5b062c0:  mov     r1, r2
   0xb5b062c2:  mov     r2, r3
   0xb5b062c4:  ldr.w   r3, [r4], dotnet/coreclr#4
   0xb5b062c8:  ldr.w   r6, [r4], dotnet/coreclr#4
   0xb5b062cc:  str.w   r6, [r5], dotnet/coreclr#4
   0xb5b062d0:  str.w   r12, [sp, dotnet/coreclr#12]
   0xb5b062d4:  pop     {r4, r5, r6, pc}

This thunk replaces the LR pushed by the first push by the value taken from [R0+16] and so the pop at the end jumps to the following piece of code:

=> 0xb59b9f10:  ldr.w   r12, [pc, dotnet/coreclr#8]   ; 0xb59b9f1c
   0xb59b9f14:  ldr.w   pc, [pc]        ; 0xb59b9f18

The values at the pc and pc + 8 are as follows:

(gdb) x/2dx 0xb59b9f18
0xb59b9f18:     0xb66f2ced      0x0000000c

So this piece of code jumps to 0xb66f2ced, which is the ResolveWorkerAsmStub asm helper. And now we are coming to the culprit. As I’ve already said, this asm helper expects R4 to contain the indirection cell address. But as you can see, the argument shuffling thunk didn’t touch R4 and so we get the R4 that came from the DomainNeutralILStubClass.IL_STUB_SecureDelegate_Invoke. And as you can see, R4 was used to jump to the argument shuffling thunk so it contains its address.

So I believe this is a JIT codegen bug. If you look at the generated code of the DomainNeutralILStubClass.IL_STUB_SecureDelegate_Invoke, you can see that at 0xa87e9a38, the indirection cell address was loaded to R4, but right in the next instruction, it was overwritten by the address that the blx called a bit later.

Thanks for opening this in the right repo 😃

@mi-hol I am just building coreclr with a fix so that I can test it with powershell on my RPI3. So I think I will probably send out PR with the fix later today.

Also, R4 is loaded as EA_PTRSIZE in the line above. Instead, it should be loaded as EA_BYREF.

Got @janvorli working