SqlClient: Possible lock contention/thread pool starvation when acquiring access token causing timeout exceptions

Describe the bug

We are experiencing connection pool problems during high load and when the Azure Access Token is expired. The following exception will be thrown:

Exception message: System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
Stack trace:

System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
   at ... 

This exception will be thrown for about 20 minutes and after that duration our operations of our component will proceed normal as expected. It seems that exception occur during startup of our application or when the Azure Access Token is expired.

To reproduce

I’ve included a Visual Studio solution with a reproduction of the issue with similar exceptions.

HighLoadAndAzureActiveDirectory.zip

Expected behavior

We expect that when getting a SQL Connection, the SQL Connection Pool shouldn’t throw InvalidOperationException during high load or when the access token is expired.

Further technical details

Our system is a .NET 6 & CoreWCF application hosted in Service Fabric. We are using Managed Identity to connect to an Azure SQL database. We should handle more than 100 concurrenct SOAP requests during normal operations (and all the concurrent SOAP requests must connect to our Azure SQL Database with the Managed Identity). We don’t have special configurations for Pool Size; just default values.

We are using the following libraries:

  • CoreWCF.Http - 1.3.2
  • Microsoft.Data.SqlClient - 5.1.1
  • Microsoft.Identity.Client - 4.56.0

Additional context We are seeing lock contentions/thread pool starvation when acquiring an access token for a SQL Connection. I’ve provided a reproduction project with similar effect when multiple tasks are opening a SQL connection to Azure SQL Database with a Azure Active Directory.

Following are printscreens of our debugging session:

parallel stacks - TryGetConnection

parallel stacks - GetFedAuthToken

It seems there are two threads are blocked in GetFedAuthToken, while 49 threads trying to get a connection from DbConnectionPool (TryGetConnection),

Browsing in the Microsoft.Data.SqlClient github repository, we noticed the following code:

                                _fedAuthToken = Task.Run(async () => await authProvider.AcquireTokenAsync(authParamsBuilder)).GetAwaiter().GetResult().ToSqlFedAuthToken();

https://github.com/dotnet/SqlClient/blob/7b2e2fc9f002e8b270894595f5f92670670053d7/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlInternalConnectionTds.cs#L2416C50-L2416C50

Could this code create a lock contention/thread pool starvation? We’ve noticed before this construction can cause lock contention/thread pool starvation. In some places this construction is used to go from async to sync: https://github.com/aspnet/AspNetIdentity/blob/main/src/Microsoft.AspNet.Identity.Core/AsyncHelper.cs

PS: I’m not sure if I’m using the correct terminlogy lock contention vs thread pool starvation.

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

Adding some investigation results:

I tried the Single Threaded Context path and that got me through the async tasks within MDS itself. But as soon as execution goes into MSAL or Azure.Identity, their async tasks are all created with ConfigureAwait(false) (standard/best practice for libraries), which moves those tasks off my Single Threaded Context and subsequently hangs with the same problem - no available threads in the system thread pool to service their tasks. So seems we might be stuck. We need sync-only APIs in MSAL and Azure Identity in order to avoid this issue. 😞

Hello, I am a colleague of Ka-Wai. We looked at the issue together last week.

I believe there’s a conceptual design flaw in SqlnternalConnectionTds calling authProvider.AcquireTokenAsync. And actually trying to go from async to sync. There are many ways to do so. But none of them are ideal.

The problem we’re facing is a burst scenario where we have many incoming connections as soon as the applications starts. Debugging this scenario lead us to SqlConnection.Open which keeps worker pool threads locked until one them is able to get the authentication token. Problem here is that a Task is scheduled to fetch the authentication token. And as all of the threads are busy waiting, the task scheduler can not delegate the task to one of the worker pool threads. The task can successfully run when the thread pool ramps up more threads. However, there’s a delay involved. In our scenario (burst scenario and max. pool size = 100), it can take up to 20 minutes until the authentication is finally succesfully retrieved. Ideally, SqlConnection.Open would not have have asynchronous or task based code down the callstack.

Restricting the max. pool size to a lower value may be a workaround for this issue. However, this may cause other issues in the application.

Setting the ThreadPool.MinThreads might be a workaround. But this can also impact application performance.

SqlConnection.OpenAsync doesn’t seem to suffer from this specific thread locking issue. Threads aren’t locked. Tasks are waiting until one the first SqlConnection is able to get the authentication token. So burst scenario’s work like a charm.

#2105 and #2149 may be related to this.

I believe there’s a conceptual design flaw in SqlnternalConnectionTds calling authProvider.AcquireTokenAsync. And actually trying to go from async to sync. There are many ways to do so. But none of them are ideal.

You are correct here. However, MSAL doesn’t offer any sync APIs and itself only uses async HTTP client APIs. So, unfortunately, SqlClient is eventually limited to calling an async API from within the sync Connection.Open() flow.

I was looking for better ways of handling the unfortunate pattern (I’ve read the common “solutions”, which all have at least one pitfall) and thinking about using a single thread synchronization context that would dedicate a thread to handle all the async token acquisition tasks to keep them off the system thread pool. I got the idea looking at how NpgSql was handling a similar situation. https://github.com/npgsql/npgsql/blob/v8.0.0-preview.4/src/Npgsql/SingleThreadSynchronizationContext.cs