runtime: ThreadAbortException not re-thrown by the runtime

It looks like the ThreadAbortException is not always re-thrown by the runtime. See below for a sample program that reproduces the issue, causing an infinite thread loop.

I have reproduced this on two 64-bit systems running .NET Framework 4.6.1 and 4.7.1 respectively. I can not reproduce on .NET Framework 4.5.2.

The issue goes away if forcing usage of the legacy jit compiler with the following config:

  <runtime>
    <useLegacyJit enabled="1" />
  </runtime>

I have reproduced with the following program built using Visual Studio Enterprise 2017, Version 15.5.2, targeting .NET Framework 4, Any CPU, Release build:

using System;
using System.Threading;

namespace ConsoleApp {
    class Program {

        static bool started = false;

        static void Main(string[] args) {
            // Start thread
            var thread = new Thread(ThreadLoop);
            thread.Start();
            while (!started);

            // Abort thread
            Console.WriteLine($"Aborting thread");
            thread.Abort();
        }

        private static void ThreadLoop() {
            Console.WriteLine($"Thread started");
            int cnt = 0;
            while (true) {
                try {
                    if (!started) {
                        Console.WriteLine($"Thread loop");
                        started = true;
                    }
                }
                catch (ThreadAbortException) {
                    Console.WriteLine($"ThreadAbortException #{cnt++}");
                }
            }
        }
    }
}

The output when reproducing:

Thread started
Thread loop
Aborting thread
ThreadAbortException #0
ThreadAbortException dotnet/coreclr#1
ThreadAbortException dotnet/coreclr#2
ThreadAbortException dotnet/coreclr#3
ThreadAbortException dotnet/coreclr#4
ThreadAbortException dotnet/coreclr#5
ThreadAbortException dotnet/coreclr#6
ThreadAbortException dotnet/coreclr#7
ThreadAbortException dotnet/coreclr#8
ThreadAbortException dotnet/coreclr#9
ThreadAbortException dotnet/coreclr#10
ThreadAbortException dotnet/coreclr#11
ThreadAbortException dotnet/coreclr#12
ThreadAbortException dotnet/coreclr#13
ThreadAbortException dotnet/coreclr#14
ThreadAbortException dotnet/coreclr#15
ThreadAbortException dotnet/coreclr#16
ThreadAbortException dotnet/coreclr#17
ThreadAbortException dotnet/coreclr#18
ThreadAbortException dotnet/coreclr#19

(continuing forever)

I am aware of issue https://github.com/dotnet/coreclr/issues/3155, but it does not sound like the exact same issue.

Any thoughts on this?

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 2
Comments: 15 (13 by maintainers)

Commits related to this issue

workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-thrift-client by AndrewKostousov 6 years ago
workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-distributed-lock by AndrewKostousov 6 years ago
workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-distributed-lock by AndrewKostousov 6 years ago
workaround for https://github.com/dotnet/coreclr/issues/16122 on net471 — committed to skbkontur/cassandra-primitives by AndrewKostousov 6 years ago

Most upvoted comments

This bug is occurring due to an unfortunate interaction between a workaround for an ancient bug, and code generated by the current C# and RyuJIT compilers.

When the VM goes to raise a Thread Abort, it checks to make sure that a thread is in a “good” place to do so. It doesn’t want to raise a Thread Abort if the thread is currently executing a “finally” (or “fault”) exception handling clause. So, ReadyForAbort() does a stack walk, and TAStackCrawlCallBackWorker looks to see if the current frame is within a finally or fault. I’m not sure what happens if it is not in a “good” location; perhaps wait and try again later?

However, TAStackCrawlCallBackWorker on amd64 also checks for a peculiar case of NOPs preceding a try, described later.

Long ago, before .NET 4, the C# compiler, when generating a lock construct, would generate something like this:

lock (thisLock)
{
    ... work ...
}

=>

Monitor.Enter(thisLock);
try
{
    ... work ...
}
finally
{
   Monitor.Exit(thisLock);
}

On AMD64, there is a rule (see here) that a call instruction immediately preceding a try region must have a NOP inserted, such that the return address of the call is in the same EH region as the call itself. So when the above generated lock construct was translated to native code on AMD64, it looks like this:

Monitor.Enter(thisLock);
NOP  // inserted by JIT to enforce above rule
try
{
    ... work ...
}
finally
{
   Monitor.Exit(thisLock);
}

Now, if a Thread Abort happens asynchronously, and is raised when this thread’s instruction pointer is at the NOP, then the lock has already been taken, but the NOP isn’t protected by the following try, so the exception handler will not invoke the finally. Thus, the lock can “leak” – it won’t be unlocked.

To work around that, code was added in TAStackCrawlCallBackWorker to check if a single NOP precedes a try region, and if so, don’t allow a Thread Abort to be raised when the IP is at that NOP. Instead, let execution continue and try again later. This check isn’t very precise, e.g.: (1) it doesn’t know if the NOP follows a call to Monitor.Enter(), (2) it doesn’t check that the EH region is a try/finally (or fault), (3) it doesn’t know if there’s a Monitor.Exit() in the finally.

Another thing to consider is that for a try/catch, where the catch catches a ThreadAbortException, but doesn’t suppress it, the VM will automatically re-raise that exception after the catch has executed. To do so, the VM needs to choose an address to use as the “re-raise address” – the address in the managed code where the exception will be considered to originate. This is important for determining which try regions get to handle the re-raised exception. The address used for this purpose is the “continuation” or “return” address from the catch clause. Thus, it is important that the exception region where this address lives be correct. Much more detail is given here.

With the failing test case here, there is a very tight loop. The catch clause continuation address ends up being either the first instruction of the try region, for the older C# compiler, or a NOP immediately preceding the try region, for the current Roslyn C# compiler.

For the older C# compiler, the continuation address is within the try. You would think that would cause the raised exception to be caught again by the catch. However, there is logic in the VM to try and make forward progress on Thread Abort reraises (in ExceptionTracker::ProcessManagedCallFrame). So, it notices that this EH clause has already been processed, and skips it. In the CLR log:

TID 14d0: ..  considering typed clause [20,3f], ControlPc is inside clause (offset 20)
TID 14d0: ..  skipping the evaluation of EH clause (index=0) since we cannot process an exception in a handler
TID 14d0: ..  that exists prior to the one (index=1) after which ThreadAbort was [re]raised.

In the Roslyn case, the continuation address is a generated NOP before the try. The special NOP logic in TAStackCrawlCallBackWorker kicks in (unintentionally), and says not to reraise. It appears at this point that the thread abort is attempted again but at offset 0x28, which is the 2nd instruction within the try. It also appears that the CLR has “forgotten” that it is attempting to do a reraise. Thus, the make “forward progress” logic doesn’t kick in, and we go ahead and reraise within the try. This leads to an infinite loop.

Note that the “lock” construct is not generated this way anymore. Currently, the generated IL looks something like:

bool lockTaken = false;
try
{
   Monitor.Enter(thisLock, ref lockTaken);
   ... work ...
}
finally
{
   if (lockTaken) {
      Monitor.Exit(thisLock);
   }
}

BruceForstall on Feb 16, 2018

@RussKeldorph, @briansull: Is anyone looking into this, or did you mean that I should post this elsewhere?

I believe this is a major issue. It very often causes infinite loops, e.g. during app domain recycling when running under the IIS. The following is a very common pattern for long running background threads.

// Thread loop that should never exit until app domain/process dies
while (true) {
    try {
        // Do important work in a long running thread
    }
    catch (Exception) {
        // Log error
    }
}

If the thread abort exception is not re-thrown automatically, this will cause an infinite loop when the app domain is meant to be shut down. We have had to go through all code and explicitly add the following to work around the issue. The alternative would be to force usage of the old jit compiler.

catch (ThreadAbortException) {
    throw;
}

hmanner on Feb 7, 2018