runtime: [Discussion] Fail Tasks stuck in Wait() on ThreadPool when starvation occurs?

ThreadPool starvation can be problematic and will never recover if the rate of ThreadPool threads blocking is higher than the rate of ThreadPool thread injection.

Suggestion

If .Wait() and its associated permutations e.g. .Result etc are done on a ThreadPool thread; this could trigger the spawning of a “finalizer” type thread that fails (cancels?) the Tasks and releases the .Waits if ThreadPool starvation occurs.

The Tasks would need to register themselves before waiting if on the CurrentThread.IsThreadPoolThread so they are available to be cancelled and unblocked.

Would likely want to operate with some ThreadPool heuristics so it doesn’t trigger too easily when natural thread injection would resolve it.

Drawbacks

Tasks would start getting randomly cancelled if the ThreadPool starts getting starved which might be an unexpected behaviour; but may be better than entering an unrecoverable state?

Example registration code

Task.cs

+ private static Dictionary<SetOnInvokeMres, Task>? s_threadPoolWaitRegistrations;
+ private static object? s_registrationLock;
  
  private bool SpinThenBlockingWait(int millisecondsTimeout, CancellationToken cancellationToken)
  {
      bool infiniteWait = millisecondsTimeout == Timeout.Infinite;
      uint startTimeTicks = infiniteWait ? 0 : (uint)Environment.TickCount;
      bool returnValue = SpinWait(millisecondsTimeout);
      if (!returnValue)
      {
          var mres = new SetOnInvokeMres();
+         var isWaitRegistered = false;
          try
          {
              AddCompletionAction(mres, addBeforeOthers: true);
+             // If running on ThreadPool register this Task and Wait so its available
+             //  to be Failed/Cancelled if ThreadPool starvation occurs.
+             if (Thread.CurrentThread.IsThreadPoolThread)
+             {
+                 lock (ThreadPoolWaitRegistrationLock)
+                 {
+                     ThreadPoolWaitRegistrations.Add(mres, this);
+                     isWaitRegistered = true;
+                 }
+             }
  
              if (infiniteWait)
              {
                  returnValue = mres.Wait(Timeout.Infinite, cancellationToken);
              }
              else
              {
                  uint elapsedTimeTicks = ((uint)Environment.TickCount) - startTimeTicks;
                  if (elapsedTimeTicks < millisecondsTimeout)
                  {
                      returnValue = mres.Wait((int)(millisecondsTimeout - elapsedTimeTicks), cancellationToken);
                  }
              }
          }
          finally
          {
+             if (isWaitRegistered)
+             {
+                 lock (ThreadPoolWaitRegistrationLock)
+                 {
+                     // Remove the wait registration.
+                     ThreadPoolWaitRegistrations.Remove(mres);
+                 }
+             }
  
              if (!IsCompleted) RemoveContinuation(mres);
              // Don't Dispose of the MRES, because the continuation off of this task may
              // still be running.  This is ok, however, as we never access the MRES' WaitHandle,
              // and thus no finalizable resources are actually allocated.
          }
      }
      return returnValue;
  }

+ private static Dictionary<SetOnInvokeMres, Task> ThreadPoolWaitRegistrations
+     => s_threadPoolWaitRegistrations ?? CreateThreadPoolRegistrations();
+ 
+ [MethodImpl(MethodImplOptions.NoInlining)]
+ private static Dictionary<SetOnInvokeMres, Task> CreateThreadPoolRegistrations()
+ {
+     Interlocked.CompareExchange(ref s_threadPoolWaitRegistrations, new Dictionary<SetOnInvokeMres, Task>(), null);
+     return s_threadPoolWaitRegistrations;
+ }
+ 
+ private static object ThreadPoolWaitRegistrationLock
+     => s_registrationLock ?? CreateThreadPoolWaitRegistrationLock();
+ 
+ [MethodImpl(MethodImplOptions.NoInlining)]
+ private static object CreateThreadPoolWaitRegistrationLock()
+ {
+     Interlocked.CompareExchange(ref s_registrationLock, new object(), null);
+     return s_registrationLock;
+ }

/cc @stephentoub @davidfowl @jkotas @kouvel

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 15 (15 by maintainers)

Most upvoted comments

There is an ETW event of ThreadPoolWorkerThreadAdjustment/Adjustment reason Starvation

However it can be hard to the track down what is causing it; the idea was to put the Tasks into a failed/cancelled state and then unblock the wait causing an exception to propagate and bubble up with a stack trace for further diagnosis.

https://docs.microsoft.com/en-us/archive/blogs/vancem/diagnosing-net-core-threadpool-starvation-with-perfview-why-my-service-is-not-saturating-all-cores-or-seems-to-stall

The trouble with ThreadPool starvation is it kills everything; killing blocking Tasks would at least allow timers and non-blocking portions of the app to continue running rather than everything becoming unresponsive

There probably are aspects of this that could be pursued, for instance the thread pool may track different type of work items, which ones are tending to block, and avoid scheduling all threads to such work items. The Windows thread pool does something like that but differentiating work items is left to the user or formalized with specific usages in APIs, which probably wouldn’t be sufficient here. There are I think, however, cases where even an ideal thread pool could be coerced to behave in the same worst-case way as you mention. I don’t think the problem there is with the thread pool. Timers I understand - their callbacks should really run at higher priority than other regular queue-order work items. Other general work items though, there is no guarantee and that is the current contract with using the thread pool. I’m sure there is more information that could be attained to do better in such cases (with new APIs), or perhaps alternate solutions, or even opportunities to expose workarounds that would not fail miserably so quickly.

Tasks would start getting randomly cancelled if the ThreadPool starts getting starved which might be an unexpected behaviour; but may be better than entering an unrecoverable state?

If you end up in this state, it signals a problem in the construction of your app, and there’s a really good chance that you’re just going to end up back here after free’ing up some threads. Code often ends up here in the first place because the pool starves, so it introduces another thread, and the work item it picks up to process itself spawns another sync-over-async operation and blocks waiting for it. Killing some waits by canceling them is likely to just lead to the exact same result: those threads freed up by canceling those waits are just going to go pick up some other work item that’s likely to end right back in the same situation.

On top of that, the code in question obviously wasn’t stress/scale tested to be robust for this workload… there’s a really good chance it also wasn’t tested for cancellation exceptions emerging spuriously from operations that were never previously cancelable. I’ve seen multiple cases where the presence of an exception where there wasn’t one before (or where code simply wasn’t expecting it) become an infinite loop or something else that’s just going to cause problems in a different way.

And even if it did help alleviate some symptoms, it could make things worse by masking the actual problem.

I’m personally skeptical this is something that should be pursued.