runtime: ThreadAbortException not re-thrown by the runtime
It looks like the ThreadAbortException is not always re-thrown by the runtime. See below for a sample program that reproduces the issue, causing an infinite thread loop.
I have reproduced this on two 64-bit systems running .NET Framework 4.6.1 and 4.7.1 respectively. I can not reproduce on .NET Framework 4.5.2.
The issue goes away if forcing usage of the legacy jit compiler with the following config:
<runtime>
<useLegacyJit enabled="1" />
</runtime>
I have reproduced with the following program built using Visual Studio Enterprise 2017, Version 15.5.2, targeting .NET Framework 4, Any CPU, Release build:
using System;
using System.Threading;
namespace ConsoleApp {
class Program {
static bool started = false;
static void Main(string[] args) {
// Start thread
var thread = new Thread(ThreadLoop);
thread.Start();
while (!started);
// Abort thread
Console.WriteLine($"Aborting thread");
thread.Abort();
}
private static void ThreadLoop() {
Console.WriteLine($"Thread started");
int cnt = 0;
while (true) {
try {
if (!started) {
Console.WriteLine($"Thread loop");
started = true;
}
}
catch (ThreadAbortException) {
Console.WriteLine($"ThreadAbortException #{cnt++}");
}
}
}
}
}
The output when reproducing:
Thread started
Thread loop
Aborting thread
ThreadAbortException #0
ThreadAbortException dotnet/coreclr#1
ThreadAbortException dotnet/coreclr#2
ThreadAbortException dotnet/coreclr#3
ThreadAbortException dotnet/coreclr#4
ThreadAbortException dotnet/coreclr#5
ThreadAbortException dotnet/coreclr#6
ThreadAbortException dotnet/coreclr#7
ThreadAbortException dotnet/coreclr#8
ThreadAbortException dotnet/coreclr#9
ThreadAbortException dotnet/coreclr#10
ThreadAbortException dotnet/coreclr#11
ThreadAbortException dotnet/coreclr#12
ThreadAbortException dotnet/coreclr#13
ThreadAbortException dotnet/coreclr#14
ThreadAbortException dotnet/coreclr#15
ThreadAbortException dotnet/coreclr#16
ThreadAbortException dotnet/coreclr#17
ThreadAbortException dotnet/coreclr#18
ThreadAbortException dotnet/coreclr#19
(continuing forever)
I am aware of issue https://github.com/dotnet/coreclr/issues/3155, but it does not sound like the exact same issue.
Any thoughts on this?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 15 (13 by maintainers)
Commits related to this issue
- workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-thrift-client by AndrewKostousov 6 years ago
- workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-distributed-lock by AndrewKostousov 6 years ago
- workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-distributed-lock by AndrewKostousov 6 years ago
- workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-primitives by AndrewKostousov 6 years ago
This bug is occurring due to an unfortunate interaction between a workaround for an ancient bug, and code generated by the current C# and RyuJIT compilers.
When the VM goes to raise a Thread Abort, it checks to make sure that a thread is in a “good” place to do so. It doesn’t want to raise a Thread Abort if the thread is currently executing a “finally” (or “fault”) exception handling clause. So,
ReadyForAbort()does a stack walk, andTAStackCrawlCallBackWorkerlooks to see if the current frame is within a finally or fault. I’m not sure what happens if it is not in a “good” location; perhaps wait and try again later?However,
TAStackCrawlCallBackWorkeron amd64 also checks for a peculiar case of NOPs preceding atry, described later.Long ago, before .NET 4, the C# compiler, when generating a
lockconstruct, would generate something like this:On AMD64, there is a rule (see here) that a
callinstruction immediately preceding atryregion must have aNOPinserted, such that the return address of thecallis in the same EH region as thecallitself. So when the above generatedlockconstruct was translated to native code on AMD64, it looks like this:Now, if a Thread Abort happens asynchronously, and is raised when this thread’s instruction pointer is at the NOP, then the lock has already been taken, but the NOP isn’t protected by the following
try, so the exception handler will not invoke thefinally. Thus, the lock can “leak” – it won’t be unlocked.To work around that, code was added in
TAStackCrawlCallBackWorkerto check if a single NOP precedes atryregion, and if so, don’t allow a Thread Abort to be raised when the IP is at that NOP. Instead, let execution continue and try again later. This check isn’t very precise, e.g.: (1) it doesn’t know if the NOP follows a call to Monitor.Enter(), (2) it doesn’t check that the EH region is a try/finally (or fault), (3) it doesn’t know if there’s a Monitor.Exit() in the finally.Another thing to consider is that for a
try/catch, where thecatchcatches aThreadAbortException, but doesn’t suppress it, the VM will automatically re-raise that exception after thecatchhas executed. To do so, the VM needs to choose an address to use as the “re-raise address” – the address in the managed code where the exception will be considered to originate. This is important for determining whichtryregions get to handle the re-raised exception. The address used for this purpose is the “continuation” or “return” address from thecatchclause. Thus, it is important that the exception region where this address lives be correct. Much more detail is given here.With the failing test case here, there is a very tight loop. The
catchclause continuation address ends up being either the first instruction of thetryregion, for the older C# compiler, or a NOP immediately preceding thetryregion, for the current Roslyn C# compiler.For the older C# compiler, the continuation address is within the
try. You would think that would cause the raised exception to be caught again by thecatch. However, there is logic in the VM to try and make forward progress on Thread Abort reraises (inExceptionTracker::ProcessManagedCallFrame). So, it notices that this EH clause has already been processed, and skips it. In the CLR log:In the Roslyn case, the continuation address is a generated NOP before the
try. The special NOP logic inTAStackCrawlCallBackWorkerkicks in (unintentionally), and says not to reraise. It appears at this point that the thread abort is attempted again but at offset 0x28, which is the 2nd instruction within thetry. It also appears that the CLR has “forgotten” that it is attempting to do a reraise. Thus, the make “forward progress” logic doesn’t kick in, and we go ahead and reraise within thetry. This leads to an infinite loop.Note that the “lock” construct is not generated this way anymore. Currently, the generated IL looks something like:
@RussKeldorph, @briansull: Is anyone looking into this, or did you mean that I should post this elsewhere?
I believe this is a major issue. It very often causes infinite loops, e.g. during app domain recycling when running under the IIS. The following is a very common pattern for long running background threads.
If the thread abort exception is not re-thrown automatically, this will cause an infinite loop when the app domain is meant to be shut down. We have had to go through all code and explicitly add the following to work around the issue. The alternative would be to force usage of the old jit compiler.