AsyncEx: Another deadlock - or at least a very slow lock

I added this to #57 but thought that since it’s closed, the repro for an bug there may not have been noticed.

Here’s the last post from that thread… Basically it’s an issue which v3 doesn’t exhibit but v4 and v5 do. I think I may have hit this in production, just after I finally moved to v4.x after staying on a patched v3.x for many months.

It’s intermittent but definitely does reproduce for me after a few executions. Hoping you can shed some light on the issue 😃 Thanks for the library and the general community work!

Post from #57 starts now…

The repro provided by @Allon-Guralnek will cause the issue for me.

In LinqPad I’ve got this

AsyncLock l = new AsyncLock();

void Main()
{
	Task.Run(() =>
	{

		Console.WriteLine("Starting tasks...");
		var tasks = Enumerable.Range(0, 100).Select(i => Task.Run(() => Do1(i))).ToArray();

		Console.WriteLine("Tasks started. Waiting for all to complete...");
		Task.WaitAll(tasks);

		Console.WriteLine("All done!");
	}).Wait();
}

void Do1(int i)
{
	if (i == 0)
		Thread.Sleep(100);

	using (l.Lock())
	{
		// mutate shared state atomically
	}

	Console.WriteLine("Done: " + i);
}

void Do2(int i)
{
	if (i == 0)
		Thread.Sleep(100);

	lock (l)
	{
		// mutate shared state atomically
	}

//	Console.WriteLine("Done: " + i);
}

You’ll note that I’ve got a Task.Run() in the main() method to get it started.

Sometimes it runs through just fine. Other times it’ll stop.

Example output

Starting tasks...
Done: 1
Tasks started. Waiting for all to complete...
Done: 2
Done: 99

and 1 minute 13 seconds later I’m still nowhere 😦 After 1 minute 28 seconds it suddenly finished

Starting tasks...
Done: 1
Tasks started. Waiting for all to complete...
Done: 2
Done: 99
Done: 4
Done: 3
Done: 6
Done: 7
Done: 8
Done: 5
Done: 9
Done: 98
Done: 0
Done: 10
Done: 11
Done: 12
Done: 13
Done: 14
{{snip}}All the rest were in order{{/snip}}

But I can hit stop & start and get it working again. 9 times out of 10 it works properly. Removing the Console.WriteLine inside the loop seems to make it fail more often.

With LinqPad I can quickly swap between Nito.AsyncEx.Coordination NuGet package - this made the poor output above Nito.AsyncEx NuGet package v3.x - works Nito.AysyncEx NuGet package v4.x (strong named one was convenient to use) - failed first go with no output at all 😦 Nito.AsyncEx v3.x with one line patched from earlier - works

So this appears to be a regression where v3.x - with the one-line patch from earlier in this issue - is good but AsyncEx v4.x and v5.x are both bad.

It’s a tricky one as it’s intermittent and does eventually come good.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 45 (17 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve been working on a few different experiments related to this issue. During my travelling for OSCON, I thought it would be useful to write out the background and context in which I’m thinking about this problem. Doing this was a great exercise; it allowed me to step back and identify each of the issues at hand (and then choose a new approach). This comment is mostly a brain-dump for now, but some of it will be reworked into a CONTRIBUTING doc at some point.

Central Postulates:

  1. AsyncEx uses locks, not lockless code. Lockless code is difficult to understand and verify, and - even more dangerously - it’s very common to think one understands how to write lockless code when this is not the case, due to the incredible complexity of memory models. See Double-Checked Locking Is Broken for a good historical example from Java.

  2. The greatest cardinal sin of concurrent programming is invoking arbitrary code while holding a lock. In the general case, this includes completing a TCS, since continuations may run synchronously.

AsyncEx Code

Most AsyncEx coordination primitive code is straightforward and doesn’t even use async/await. However:

  • Some methods do use async/await.
  • One interaction is particularly notable: the relationship between AsyncLock and AsyncConditionVariable. The ACV needs to be able to unlock and relock the AL without affecting the IDisposable “key” that the user already has.
  • Coordination primitives which queue are built on a wait queue abstraction, which permits users to do fancy things like create priority locks. This wait queue abstraction does force us to use sync-over-async in some places instead of using the (preferred) boolean argument hack; AsyncConditionVariable is a good example of this.

So, a good cross-section for experimentation purposes is:

  • AsyncManualResetEvent (extremely simple, no queue)
  • AsyncSemaphore (simple with queue)
  • AsyncLock + AsyncConditionVariable (internal interaction, and code that requires sync-over-async).

When I run experiments like this, I focus on these four types. If it works for them, it should work anywhere.

AsyncEx v4

In v4, all TCS’s used synchronous continuations, just like “normal” async code. v4 avoided invoking end-user callbacks under lock by separating the TCS completion code from internal logic. This necessitated a bit of a “dance” between the wait queue and the coordination primitives, with the completion code represented by a disposable.

This worked well enough, but has always had one theoretical issue: stack overflow is a possibility. The TPL does have heuristics in place to prevent this, but they don’t always work. AFAIK, this was never actually observed in production.

In v5, one of the main changes was to use asynchronous continuations for TCS’s. This simplified the wait queue code (and coordination primitive code) by removing the “dance” completely. Unfortunately, this has caused a dependency on the thread pool even for internal operations. And this issue is not theoretical; it is very rare but has certainly been observed.

Possible Solutions

Solutions must maintain the Central Postulates above, in addition to these considerations:

  • If we have a dedicated thread, then we want to keep end-user code from ever running on that thread. The thread must be dedicated to executing our own short, nonblocking internal code.
  • Our internal code must not depend on a thread pool thread. In the original example code for this problem, Lock returned an IDisposable whose Dispose implementation dependended on a thread pool thread. However, it is permissible to depend on a thread pool thread to resume executing asynchronous end-user code.

The solutions that I’ve considered (some of these have experimental code in different branches):

  1. ForceInline:

    • Use RunContinuationsAsynchronously | DenyChildAttach flag for all TCS tasks. These can be returned to the user or our own code.
    • Our code uses ForceInline to override the RunContinuationsAsynchronously flag and force our continuations to be synchronous.
    • Evaluation: 👍 Similar to how the old code works. 👎 No protection from stack exhaustion. The old code did (in theory). 👎 Users can use ForceInline to override our own force-asynchronous behavior.
  2. Dedicated Thread:

    • Use RunContinuationsAsynchronously | DenyChildAttach flag for all TCS tasks. These can be returned to the user or our own code.
    • Our code uses ResumeOnDedicatedThread to override the RunContinuationsAsynchronously flag and force the continuations to run asynchronously on our dedicated thread.
    • Our code does not need to complete TCS’s from the dedicated thread. 👍 Our code always runs on the dedicated thread. 👎 Users can use ForceInline to override our own force-asynchronous behavior, and thus execute end-user code on our dedicated thread.
  3. TCS Pair with ConfigureAwait(false)

    • FirstTcs is a normal TCS.
    • SecondTcs is a normal TCS, linked from FirstTcs by giving FirstTcs a continuation that propagates its completion to SecondTcs, wrapped with an explicit Task.Run.
    • Our code awaits FirstTcs using ConfigureAwait(false).
    • User only sees SecondTcs, which is forced asynchronous (due to Task.Run). 👍 Similar to how the old code works, including (theoretical) protection from stack exhaustion. 👍 Users cannot force synchronous behavior using ForceInline, due to the explicit Task.Run. 👎 Our awaits are not guaranteed to be synchronous. If stack exhaustion is imminent, or if the user is (indirectly) completing a TCS that doesn’t want to run our continuation, then we will still require a thread pool thread. The old v4 code had this same problem.
  4. TCS Pair with ForceInline

    • Just like above, but instead of awaiting FirstTcs.ConfigureAwait(false), our code awaits FirstTcs.ForceInline(). 👍 Similar to how the old code works. 👍 Users cannot force synchronous behavior using ForceInline, due to the explicit Task.Run. 👎 No protection from stack exhaustion. The old code did (in theory).
  5. TCS Pair with ResumeOnDedicatedThread

    • Just like above, but instead of awaiting FirstTcs.ConfigureAwait(false), our code awaits FirstTcs.ResumeOnDedicatedThread(). 👍 Our code always runs on the dedicated thread. 👍 Users cannot use ForceInline to override our own force-asynchronous behavior, due to the explicit Task.Run. This prevents users from executing end-user code on our dedicated thread.

Conclusion

After writing all these out, I believe the best solution going forward is the last one. My next step is to experiment with that solution on the experimental cross-section (AsyncManualResetEvent + AsyncSemaphore + AsyncLock + AsyncConditionVariable), and if that looks good, apply it through the rest of the code.

Yes, this is actually a known problem (though it didn’t have a GH issue before). In fact, this is exactly the reason why AsyncEx v5 is still in prerelease. One of my commercial clients first made me aware of it, and while they’re waiting for a fix they just cranked up the min-threads on the thread pool as a workaround.

This problem comes up most often when:

  • Many thread pool threads are synchronously waiting on an AsyncEx primitive.
  • The thread pool is exhausted.

What’s actually happening is that the first lock goes through just fine (synchronously). Then the thread pool is exhausted with a bunch of threads synchronously blocked on that lock. Then when the first thread unlocks, it releases the next waiter, but requires a thread pool thread to do so. Thus, it must wait until the thread pool gets around to running that “release” code. Meanwhile, the thread pool is injecting one new thread every 2 seconds or so, in no rush. So we end up with a lock convoy with huge wait times due to the thread pool injection rate.

This is due to the core change of v4 -> v5: all continuations are asynchronous in the core of all the primitives (AsyncWaitQueue is the “core”). This simplifies the code considerably, but does assume that there is available room on the thread pool.

For a solution, I’m trying out variations of a custom “completer” thread. Something very similar to Ayende’s blog here.

@ygoe Yes, I have considered releasing a stable version with this as a known bug. The perfectionist in me doesn’t like that, but commercial software does it all the time.

I’ve done a ton of work in different branches, approaching the problem different ways, but hitting blocks each time. I did finally find a permanent solution, but haven’t had time to implement it (the final solution also requires careful documentation).

At the same time, I’m aware of the problems with keeping a library prerelease for so long. My current plan is to try to finish this issue once and for all over MVP summit (one time of the year when I have a lot more free time than normal). If I don’t finish it then, then I promise I’ll release it as stable with a known issue.

That is very interesting. Particularly since the thread pool has special dedicated I/O threads so that the limited thread injection rate shouldn’t affect that kind of scenario. But in reality, a lot of the time the BCL code running on the I/O thread will post work to a worker thread so it “jumps” over to the more limiting side before it executes end-user code.

Concerning the use of SemaphoreSlim, I observe the same issue here, but only when async and sync stuff are mixed.

Here is the code to reproduce this issue using SemaphoreSlim:

using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

namespace Example
{
  public class Program
  {
    static readonly SemaphoreSlim s = new SemaphoreSlim(1, 1);

    static void Main()
    {
      Task.Run(() =>
      {
        Console.WriteLine("Starting tasks...");
        var sw = new Stopwatch();
        sw.Start();

        var tasks = RunTasks(onlySync: true, onlyAsync: false);

        Console.WriteLine("Sync tasks started. Waiting for all to complete...");
        Task.WaitAll(tasks);

        sw.Stop();

        Console.WriteLine("All sync tasks done! Time elapsed: " + sw.Elapsed.TotalSeconds + " seconds");

        sw.Reset();
        sw.Start();

        tasks = RunTasks(onlySync: false, onlyAsync: true);

        Console.WriteLine("Async tasks started. Waiting for all to complete...");
        Task.WaitAll(tasks);

        sw.Stop();

        Console.WriteLine("All async tasks done! Time elapsed: " + sw.Elapsed.TotalSeconds + " seconds");

        sw.Reset();
        sw.Start();

        tasks = RunTasks(onlySync: false, onlyAsync: false);

        Console.WriteLine("Mixed tasks started. Waiting for all to complete...");
        Task.WaitAll(tasks);

        sw.Stop();

        Console.WriteLine("All mixed tasks done! Time elapsed: " + sw.Elapsed.TotalSeconds + " seconds");
      }).Wait();

      Console.ReadLine();
    }

    static Task[] RunTasks(bool onlySync, bool onlyAsync)
    {
      bool mixed = !onlySync && !onlyAsync;
      var tasks = new Task[30];

      for (int i = 0; i < 30; i++)
      {
        Task task;
        int i1;

        if (mixed)
        {
          if (i % 2 == 0)
          {
            i1 = i;
            task = Task.Run(() => DoWithSemaphore(i1));
          }
          else
          {
            i1 = i;
            task = Task.Run(async () => await DoWithSemaphoreAsync(i1));
          }
        }
        else if (onlySync)
        {
          i1 = i;
          task = Task.Run(() => DoWithSemaphore(i1));
        }
        else
        {
          i1 = i;
          task = Task.Run(async () => await DoWithSemaphoreAsync(i1));
        }

        tasks[i] = task;
      }

      return tasks;
    }

    static void DoWithSemaphore(int i)
    {
      s.Wait();

      try
      {
        // mutate shared state atomically
        Thread.Sleep(100);
      }
      finally
      {
        s.Release();
      }

      Console.WriteLine("Done: " + i);
    }

    static async Task DoWithSemaphoreAsync(int i)
    {
      await s.WaitAsync();

      try
      {
        // mutate shared state atomically
        await Task.Delay(100);
      }
      finally
      {
        s.Release();
      }

      Console.WriteLine("Done (async): " + i);
    }
  }
}
  • When running the code and we only use “sync” tasks, the program completes after about 3 seconds.

  • When running the code and we only use “async” tasks, the program completes after about 3 seconds.

  • When running the code and we use mixed tasks, the program takes much longer to terminate (it varies a bit).

When you take a look at the source code of SemaphoreSlim, you can see that they are using a TaskNode class that is just a derived Task<> constructed with parameter TaskCreationOptions.RunContinuationsAsynchronously.

As soon as async waiters are in the game (see line 358 of SemaphoreSlim), all follow-up sync waiters are also put in the queue of async waiters.

On the other side, if we only have sync waiters (and no async ones), a pulse-wait approach is used, which does not need a thread pool.

And if we only have async waiters, the problem does not come into play, since we do not lock a task; instead we await it, and when awaited, the underlying thread goes back to the tread pool.

So, I don’t think that using SemaphoreSlim solves the issue.

@StephenCleary You mentioned that you have a final solution in mind. I’m looking forward for it and a little bit curious what kind of solution it is.

Reading the VS release notes for 15.4.1 (out as of about 40 minutes ago) I came across this Roslyn thread https://github.com/dotnet/roslyn/issues/22650#issuecomment-335638013

It’s not quite the same issue but it reminded me of this discussion. A block synchronous resource - disk - coupled with lots of async calls piling up, then the thread pool starving (or growing very slowly). Their problem is they also gobbled up RAM at the same time. The solution they seem to be discussing (and have released) is to have a single thread to which the other tasks post their requests for work.

Sorry if this is off-topic - I just found it interesting as another manifestation of a problem in the same ballpark as this thread’s issue.

Possibly complicated, but it’s entirely abstracted away by the library to the point where 99% of library consumers won’t even know it’s there.

Making continuations asynchronous avoids the “sometimes sync, sometimes async” behaviour that brought jQuery promises a bad name. I appreciate the environment is different - since JS is completely single-threaded - but the consistency argument still applies (see http://blog.izs.me/post/59142742143/designing-apis-for-asynchrony and https://medium.com/@bluepnume/intentionally-unleashing-zalgo-with-promises-ab3f63ead2fd for the JS-side of the fence).

Making the continuations synchronous has its own problems. For a given TaskCompletionSource, its continuations can either be forced-asynchronous (used by v5) or unspecified (used by v4).

Synchronization primitives in v4 avoid this thread pool exhaustion issue by using synchronous continuations. However, since end-user code is also expressed as a continuation, we have to force those continuations to be asynchronous or else we’d end up with a callback-under-lock scenario, which very easily results in deadlock. So in v4, all continuations are actually semi-asynchronous; they’re synchronous enough to avoid the thread pool exhaustion but asynchronous enough to avoid the deadlock. As you can imagine, this complicates the code considerably.

I do much prefer the approach forcing asynchrony for end-user continuations, just from a code perspective (simpler, easier to prove correct, easier to maintain). Assuming I keep this, there are two options for avoiding the thread pool exhaustion issue: introducing a separate thread, and separating end-user continuations from internal continuations (requiring separate TCSs). I’ve played with both and am definitely leaning towards the thread solution.

Yes, the before code didn’t change that. All continue outside still run in async. WaitSync() just create an temp Task t2, and force it to be run after the orig task in sync, and Wait the new Task.

SemaphoreSlim does not guarantee the order in which the threads are released.

Having this guarantee is an important feature of AsyncLock IMO.

Historically, SemaphoreSlim did not support asynchronous operations at the time AsyncLock was written. I’m not sure if SemaphoreSlim supports them on all the AsyncEx platforms today. But even if it did, AsyncLock has some advanced usages (namely, integration with AsyncConditionVariable and the ability to use custom queues to implement alternative locking strategies such as prioritized locks) that can’t be supported with a SemaphoreSlim wrapper.

Under the hood, SemaphoreSlim also uses a queuing approach, but their implementation avoids the thread pool exhaustion issue here. They use an approach more similar to what I used in AsyncEx v4 and earlier.

Just to clarify, the synchronous vs asynchronous continuations that I’m talking about are only when there is a continuation, so from the end-user’s perspective this only applies when the await is asynchronous. When I’m talking about the continuation being synchronous or asynchronous, that’s from the perspective of the TCS-completing thread.

To use a lock example, LockAsync will behave synchronously if the lock is unlocked; that’s not going to change. The situation we’re fixing here is when the lock is already locked, and some code UseResource calls LockAsync (which behaves asynchronously since the lock is already locked) and awaits on the returned task. Then when some thread releases the lock, we need to complete that TCS, but we can’t do it while holding an internal lock. We need to push that completion outside of the lock, which is most easily done by forcing the continuation to be asynchronous.

So there are still plenty of situations with AsyncEx where end-user code will await, and it will behave synchronously, just like other TAP code. This issue is purely about a careful internal design that will be maintainable, prevent deadlocks, and also continue to work when the thread pool is exhausted. It won’t have any effect on end-user semantics as far as which operations act synchronously and which ones are asynchronous.

This actually comes down to how await schedules continuations. Specifically, it always schedules them synchronously - so, after an await acts asynchronously, when its task is completed, the asynchronous method that did the await attempts to resume synchronously on the same thread and stack that is completing the task. When I discovered this, I found it very surprising and reported it as a bug (closed as “by design”). It can cause deadlocks (as described in my bug report) as well as stack overflows, unless end-user code take steps to prevent it.

It’s my personal opinion that JavaScript’s promise callback and await semantics are much better than C#'s await semantics. I think Microsoft was too focused on micro-benchmarks and missed out on providing an easier-to-use abstraction for developers.

The beauty of asynchronous programming is that “there is no thread”. I learned it from Stephen, both from his book and articles.

If it’s easier to maintain I understand of course, but I just find it a bit “sad” we have to rely on spawning a manual Thread, which isn’t controlled by the ThreadPool. It still feels complicated.