scripthookvdotnet: Usage of fibers causes random .NET runtime crashes

Version All

Description In its main loop (ScriptMain), SHVDN makes use of fibers and also calls into unmanaged code implemented using fibers. This is not supported by .NET and will inevitably lead to random crashes, most notably making the .NET exception handler in CLRVectoredExceptionHandler assume that there is no stack space left. The Microsoft docs state “The .NET threading model does not support fibers. You should not call into any unmanaged function that is implemented by using fibers. Such calls may result in a crash of the .NET runtime.” With my limited knowledge about the project it is unclear to me why a fiber-based approach was chosen here as it seems like a grave design mistake, but I did not study the entire project.

Crash Details Since a fiber is not its own thread, but behaves like one in certain ways, it has its own stack space. This means that for one thread there will be different stack lower and upper bounds, depending on whether the fiber is running or not. Whenever an exception in .NET occurs, the crash handler in CLRVectoredExceptionHandler gets called. (I am referencing the initial coreclr commit here as it is closer to what .NET Framework uses than the current .NET core implementation which does not use stack probing at all and hence is not affected). The call to Thread::IsStackSpaceAvailable returns false when on a fiber-stack, since its internal call to GetLastNormalStackAddress uses a cached stack limit. Naturally this should be using the fiber’s stack bounds, but due to the caching it does not. While this is definitely not handled ideally by .NET Framework (and fixed in .NET core), the issue remains that fibers are not supported. Unfortunately, this means that any exception, whether it is a C++ exception as mentioned in #936 or a .NET exception, will cause the runtime to panic and call DontCallDirectlyForceStackOverflow, subsequently terminating the process. Please note that this crash does not occur on every machine and seemingly at random, but since I had access to a user machine where it happened on every single exception, it was very easy to debug and pinpoint.

Example stack trace where the offending line is a a NullReferenceException in .NET wrapped by try-catch (which is not hit for the reasons outlined above): 0:000> !dumpstack OS Thread Id: 0x882c (0) Current frame: clr!DontCallDirectlyForceStackOverflow+0x10 Child-SP RetAddr Caller, Callee 00000011620f31f0 00007ffc7bbe9a11 clr!CLRVectoredExceptionHandler+0xa8, calling clr!DontCallDirectlyForceStackOverflow 00000011620f3220 00007ffc7ba2f96e clr!SaveCurrentExceptionInfo+0x72, calling clr!ClrFlsSetValue 00000011620f3250 00007ffc7ba2fdcf clr!CLRVectoredExceptionHandlerShim+0xa3, calling clr!CLRVectoredExceptionHandler 00000011620f3280 00007ffc97e883dc ntdll!RtlpCallVectoredHandlers+0x108, calling ntdll!guard_dispatch_icall_nop 00000011620f3320 00007ffc97e5b406 ntdll!RtlDispatchException+0x66, calling ntdll!RtlpCallVectoredHandlers 00000011620f3350 00007ffc7b8d882a clr!invokeCompileMethod+0x97, calling clr!invokeCompileMethodHelper 00000011620f33c0 00007ffc7b8d875e clr!CallCompileMethodWithSEHWrapper+0xe5 00000011620f33f0 00007ffc97e25d21 ntdll!RtlFreeHeap+0x51, calling ntdll!RtlpFreeHeapInternal 00000011620f3430 00007ffc7b8c5809 clr!EEHeapFreeInProcessHeap+0x45, calling KERNEL32!HeapFreeStub 00000011620f3460 00007ffc7b8d864d clr!UnsafeJitFunction+0x81b, calling clr!_security_check_cookie 00000011620f3530 00007ffc97eafe3e ntdll!KiUserExceptionDispatch+0x2e, calling ntdll!RtlDispatchException 00000011620f4330 00007ffc1ca6dc66 (MethodDesc 00007ffc1c87e010 +0x26 Rage.Attributes.PluginAttribute.get_Name()) ====> Exception Code c0000005 cxr@00000011620f3540 exr@00000011620f3a30

Resolution I was able to fix this issue by removing the fiber logic from ScriptMain as well as no longer relying on SHV’s scriptRegister. Unfortunately, you will have to provide your own script VM tick hook as SHV uses fibers and just removing the fiber logic from SHVDN is not enough. For a simple PoC, I hooked a native and ticked SHVDN from there and it worked fine: no more fiber related crashes! You should be able to still use SHV to receive keyboard callbacks according to my own testing.

If you have any questions or feedback as to why fibers were used (or must be used), please let me know.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 34 (25 by maintainers)

Commits related to this issue

Most upvoted comments

The reason RPH reimplements the main script loop is to allow execution of certain RPH functionality, such as console commands (which may run natives) even when the game is paused or is not ticking script threads for another reason. At its heart, the tick for plugins is still handled via a normal scrThread in the list and you probably want to do the same.