SqlClient: Very bad ADO.NET performance on Linux

First time reporter - so please let me know if I am missing some important information.

We are investigating a very bad performance characteristics on Linux. It seems it has something to do with the SqlConnection handling.

.NET 5
Azure App Service
Same code runs on Windows very well (same configurations, even with more traffic)
Affected both databases our app connected to
Database usage levels are both very low (<10%). We don’t see any particularly slow query. In a particular memory dump, we see about 950 threads stuck in this same stacktrace:

00007F5EC7F2A838 00007f65d61873f9 [HelperMethodFrame: 00007f5ec7f2a838] System.Threading.WaitHandle.WaitMultipleIgnoringSyncContext(IntPtr*, Int32, Boolean, Int32)
00007F5EC7F2A980 00007f65667dfa86 System.Threading.WaitHandle.WaitMultiple(System.ReadOnlySpan`1<System.Threading.WaitHandle>, Boolean, Int32)
00007F5EC7F2AA00 00007f6562ee7b96 Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(System.Data.Common.DbConnection, UInt32, Boolean, Boolean, Microsoft.Data.Common.DbConnectionOptions, Microsoft.Data.ProviderBase.DbConnectionInternal ByRef) [/_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs @ 1172]
00007F5EC7F2AA80 00007f65667df0d7 Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(System.Data.Common.DbConnection, System.Threading.Tasks.TaskCompletionSource`1<Microsoft.Data.ProviderBase.DbConnectionInternal>, Microsoft.Data.Common.DbConnectionOptions, Microsoft.Data.ProviderBase.DbConnectionInternal ByRef) [/_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs @ 1142]
00007F5EC7F2AAE0 00007f6562ee553b Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(System.Data.Common.DbConnection, System.Threading.Tasks.TaskCompletionSource`1<Microsoft.Data.ProviderBase.DbConnectionInternal>, Microsoft.Data.Common.DbConnectionOptions, Microsoft.Data.ProviderBase.DbConnectionInternal, Microsoft.Data.ProviderBase.DbConnectionInternal ByRef) [/_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/ProviderBase/DbConnectionFactory.cs @ 121]
00007F5EC7F2AB70 00007f65667dea2b Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(System.Data.Common.DbConnection, Microsoft.Data.ProviderBase.DbConnectionFactory, System.Threading.Tasks.TaskCompletionSource`1<Microsoft.Data.ProviderBase.DbConnectionInternal>, Microsoft.Data.Common.DbConnectionOptions) [/_/src/Microsoft.Data.SqlClient/netcore/src/Common/src/Microsoft/Data/ProviderBase/DbConnectionInternal.cs @ 347]
00007F5EC7F2ABC0 00007f65667de80c Microsoft.Data.SqlClient.SqlConnection.TryOpen(System.Threading.Tasks.TaskCompletionSource`1<Microsoft.Data.ProviderBase.DbConnectionInternal>, Microsoft.Data.SqlClient.SqlConnectionOverrides) [/_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs @ 1721]
00007F5EC7F2AC00 00007f65667ddbe7 Microsoft.Data.SqlClient.SqlConnection.Open(Microsoft.Data.SqlClient.SqlConnectionOverrides) [/_/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs @ 1211]

Any idea if this might already be known? What next step should we do to investigate?

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 26 (14 by maintainers)

Most upvoted comments

Thanks @Wraith2 - would it be a fair approximate summary to say that SqlClient is synchronously blocking TP threads (via locking/sleeping/IO/whatever)? Is this a MARS-only phenomenon?

Any async high throughput scenario with managed sni will likely exhibit the same problem, there’s at least one other open issue where not related to mars where opening large numbers of connections has the same problem. I don’t quite know where the problem is but I now recognize it as worker thread starvation well enough to suggest trying the threadpool configuration, in this case it worked.

I know the MS team are aware of this problem.

Wraith2 on Aug 2, 2021

If for some reasons, it hits the thread pool limit, the instance will fail and never recover. For the traffic we are testing, I can handle it with ~10 Windows instances, there were some spikes due to GC, but then they recovered quickly. For Linux if an instance is bad, it is bad “forever” (until it’s taken out of the pool and restarted).

Also as we are serving wide range of traffic on different sites, fixing the pool size is something we’d like a avoid (no size fits all).

Setting the min thread count definitely helped us (and we are grateful 😃 ), but we also see that is a workaround instead of actual fix 😃

quanmaiepi on Aug 9, 2021

No, I mean ThreadPool. The native and managed implementations of the SNI layer have different ways of dealing with waits and the managed on relies one the threadpool for callbacks, if you saturate the threadpool workers things stop working. If that’s the case here then upping the minimum available threads alleviates it by starting the threadpool much higher, default is 10 i’d suggest bumping it up to 100.

Wraith2 on Aug 2, 2021