runtime: [ARM] Intermittent segfaults in JIT/Methodical/cctor/misc/threads1_cs_r
One of the possible stack traces that gdb shows after loading a core dump:
$ gdb clr-debug/corerun core
Reading symbols from clr-debug/corerun...done.
[New LWP 22222]
[New LWP 22214]
[New LWP 22216]
[New LWP 22218]
[New LWP 22220]
[New LWP 22215]
[New LWP 22219]
[New LWP 22223]
[New LWP 22221]
[New LWP 22217]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `clr-debug/corerun tests-release/JIT/Methodical/cctor/misc/threads1_cs_r/threads'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000 in ?? ()
[Current thread is 1 (Thread 0xadbff450 (LWP 22222))]
(gdb) bt
#0 0x00000000 in ?? ()
dotnet/coreclr#1 0xb650a34c in DomainLocalModule::GetPrecomputedNonGCStaticsBasePointer (this=0xb1c2a594) at /home/mskvortsov/git/coreclr/src/vm/appdomain.hpp:234
dotnet/coreclr#2 0xb648c48c in CallDescrWorker (pCallDescrData=0xadbfe4c0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.cpp:135
dotnet/coreclr#3 0xb648c32e in CallDescrWorkerWithHandler (pCallDescrData=0xadbfe4c0, fCriticalCall=0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.cpp:78
dotnet/coreclr#4 0xb648d374 in MethodDescCallSite::CallTargetWorker (this=0xadbfe62c, pArguments=0xadbfe6a0, pReturnValue=0x0, cbReturnValue=0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.cpp:645
dotnet/coreclr#5 0xb637b4ee in MethodDescCallSite::Call (this=0xadbfe62c, pArguments=0xadbfe6a0) at /home/mskvortsov/git/coreclr/src/vm/callhelpers.h:433
dotnet/coreclr#6 0xb649a438 in ThreadNative::KickOffThread_Worker (ptr=0xadbfeab8) at /home/mskvortsov/git/coreclr/src/vm/comsynchronizable.cpp:257
dotnet/coreclr#7 0xb644b07c in ManagedThreadBase_DispatchInner (pCallState=0xadbfe93c) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9187
dotnet/coreclr#8 0xb644e8a8 in ManagedThreadBase_DispatchMiddle (pCallState=0xadbfe93c) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9238
dotnet/coreclr#9 0xb644e760 in ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::{lambda(Param*)#1}::operator()(Param*) const (this=0xadbfe884, pParam=0xadbfe8e8) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9476
dotnet/coreclr#10 0xb644e620 in ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const (this=0xadbfe8d0, pArgs=0xadbfe8d8) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9478
dotnet/coreclr#11 0xb644adb0 in ManagedThreadBase_DispatchOuter (pCallState=0xadbfe93c) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9515
dotnet/coreclr#12 0xb644aee6 in ManagedThreadBase_FullTransitionWithAD (pAppDomain=..., pTarget=0xb649a231 <ThreadNative::KickOffThread_Worker(void*)>, args=0xadbfeab8, filterType=ManagedThread) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9536
dotnet/coreclr#13 0xb644ae6e in ManagedThreadBase::KickOff (pAppDomain=..., pTarget=0xb649a231 <ThreadNative::KickOffThread_Worker(void*)>, args=0xadbfeab8) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:9571
dotnet/coreclr#14 0xb649a998 in ThreadNative::KickOffThread (pass=0x1063f0) at /home/mskvortsov/git/coreclr/src/vm/comsynchronizable.cpp:376
dotnet/coreclr#15 0xb6441e1c in Thread::intermediateThreadProc (arg=0xd4ec0) at /home/mskvortsov/git/coreclr/src/vm/threads.cpp:2584
dotnet/coreclr#16 0xb69a2f02 in CorUnix::CPalThread::ThreadEntry (pvParam=0x10c048) at /home/mskvortsov/git/coreclr/src/pal/src/thread/thread.cpp:1749
dotnet/coreclr#17 0xb6f295b4 in start_thread (arg=0x0) at pthread_create.c:335
dotnet/coreclr#18 0xb6d1caac in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 24 (24 by maintainers)
This issue is due to an issue described in https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-arm-board-odroid-ux4-with-gdbgdbserver/. The article also describes a kernel patch that mitigates the problem. I have finally got to trying to patch my Odroid XU4 kernel (XU4 is based on Exynos 5422) so that it returns cache line size 32 unconditionally for both little and big cores. And I can confirm it fixes the problem. With that patch applied, I could build managed parts of coreclr repo and all the thousands of coreclr managed tests just fine. Without that, I couldn’t even build System.Private.CoreLib.dll.
It seems there is no way to fix that programatically, as the cache can be flushed only by the kernel on arm and the issue is in the kernel code.
Btw, ARM doc has some details on different cache sizes in heterogenous systems: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438e/BABHAEIF.html
It seems that Linux kernel has a fix for the same kind of issue for ARM64 (https://osdn.net/projects/android-x86/scm/git/kernel/commits/116c81f427ff6c5380850963e3fb8798cc821d2b), but not for ARM.
With introducing Tiered JIT this issue occurs much more frequent and makes CoreCLR almost unusable on ARM32 CPUs with big.LITLLE architecture (thanks they are not very popular). Unfortunately we can not solve this issues from user space (like Mono did it for ARM64), so we’re going to prepare fix to linux kernel like it was made for ARM64 (https://github.com/torvalds/linux/commit/116c81f427ff6c5380850963e3fb8798cc821d2b).
Patch from article helps:
But I’m not sure that it will be accepted in upstream=)