SqlClient: Queries with MultipleActiveResultSets=True (MARS) are very slow / time out on Linux

Describe the bug

TL;DR:
Queries using connections with MARS enabled, even when they don’t use MARS, are much slower or even time out on Linux. The same queries are fast and reliable on Windows no matter whether MARS is disabled or enabled and on Linux when MARS is disabled.

Context Octopus Cloud hosts Octopus Deploy instances in Linux containers on Azure AKS with data stored in Azure Files and Azure SQL. A couple of months ago we noticed that some of the SQL queries were much slower or even started timing out which is not something we’ve experienced before on Windows using Full .NET Framework. Some of the slowdown might be caused by AKS (K8s) but we think that the SqlClient might also be playing a role here. 119112824000676 is our Azure Support Request if that helps in any way.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso)
   at reprocli.Program.Scenario4(String connString, Int32 number)
   at reprocli.Program.<>c__DisplayClass0_0.<Main>b__0(Int32 n)
   at System.Linq.Parallel.ForAllOperator`1.ForAllEnumerator`1.MoveNext(TInput& currentElement, Int32& currentKey)
   at System.Linq.Parallel.ForAllSpoolingTask`2.SpoolingWork()
   at System.Linq.Parallel.SpoolingTaskBase.Work()
   at System.Linq.Parallel.QueryTask.BaseWork(Object unused)
   at System.Linq.Parallel.QueryTask.<>c.<.cctor>b__10_0(Object o)
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__274_0(Object obj)
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
ClientConnectionId:005d2aae-9409-4711-aaa0-b03b70f2832e
Error Number:-2,State:0,Class:11
ClientConnectionId before routing:e3300799-fdd0-40a4-84ea-b9f383596b12
Routing Destination:fed2c41af7dc.tr5.westus2-a.worker.database.windows.net,11063<---

We also captured TCP dumps while running the tests on Linux and it looks like enabling MARS causes TCP RST.

image

image

Full TCP Dumps: https://github.com/benPearce1/k8s-sql-timeout-repro/tree/tiny/source/reprocli/tcpdumps

To reproduce

Code

Repo with the sample app: https://github.com/benPearce1/k8s-sql-timeout-repro/blob/tiny/source/reprocli/Program.cs. Compiled folder contains pre-compiled versions of the app so .NET Core SDK doesn’t have to be present on the target VMs.

The first parameter is the level of parallelism. The second parameter is the connection string.

using System;
using System.Data;
using System.Diagnostics;
using System.Linq;
using Microsoft.Data.SqlClient;

namespace reprocli
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                var count = int.Parse(args[0]);
                var connectionString = args[1];

                var total = Stopwatch.StartNew();

                PrepareData(connectionString);
                total.Restart();
                Enumerable.Range(0,count)
                    .AsParallel()
                    .WithDegreeOfParallelism(count)
                    .ForAll(n => Scenario4(connectionString, n));

                Console.WriteLine($"Total: {total.Elapsed}");

            }
            catch (Exception e)
            {
                Console.WriteLine(e);
                throw;
            }
        }

        private static void Scenario4(string connString, int number)
        {
            var userStopWatch = Stopwatch.StartNew();

            var buffer = new object[100];
            for (var i = 0; i < 210; i++)
            {
                var queryStopWatch = Stopwatch.StartNew();


                using (var connection = new SqlConnection(connString))
                {
                    connection.Open();
                    using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
                    {
                        using (var command = new SqlCommand("SELECT * From TestTable", connection, transaction))
                        {
                            using (var reader = command.ExecuteReader())
                            {
                                while (reader.Read())
                                {
                                    reader.GetValues(buffer);
                                }
                            }
                        }

                        transaction.Commit();
                    }
                }

                queryStopWatch.Stop();
                Console.WriteLine($"Number: {number}. Query: {i} Time: {queryStopWatch.Elapsed}");
            }

            userStopWatch.Stop();
            Console.WriteLine($"Number: {number}. All Queries. Time: {userStopWatch.Elapsed}");
        }

        static void PrepareData(string connectionString)
        {
            var createTable = @"
                DROP TABLE IF EXISTS TestTable;
                CREATE TABLE TestTable
                (
                    [Id] [nvarchar](50) NOT NULL PRIMARY KEY,
                    [Name] [nvarchar](20) NOT NULL
                );";

            using (var connection = new SqlConnection(connectionString))
            {
                connection.Open();
                using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
                {
                    using (var command = new SqlCommand(createTable, connection, transaction))
                    {
                        command.ExecuteNonQuery();
                    }

                    transaction.Commit();
                }
            }

        }
    }
}

This is how we reproduced the problem which doesn’t mean you need this exact config.

The database was hosted in an Azure SQL Elastic Pool (Standard: 300 eDTUs) on a SQL Server in West US 2 region.

LINUX

Run the sample app with the following arguments on a Linux (ubuntu 18.04) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2 region.

MARS ON

dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=True;'

The expected result is that the app finishes without throwing any errors but that’s not the case and Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. is thrown.

Reducing the level of parallelism to 20 stops the app from crashing.

Also, when MARS is ON the console doesn’t show any progress for 10+ seconds. This is not the case when MARS is OFF.

MARS OFF

dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=False;'

The expected result is that the app finishes without throwing any errors which is the case. The app finished in just under 25 seconds. Total: 00:00:24.9737616. The app also worked with much higher levels of parallelism (e.g. 500)

AKS

Same spec as above: Linux (ubuntu 18.04) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2. We also ran this test in a container in AKS and the results were pretty much the same. The only difference was that we had to lower the level of parallelism even more. K8s networking adds a bit of overhead which might make the problem more pronounced.

WINDOWS

Run the sample app with the following arguments on a Windows (Windows Server 2016 Datacenter) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2 region.

dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=True;'

The expected result is that the app finishes without throwing an exception which is the case. The app finished in just under 24 seconds. Total: 00:00:23.4068641. It also worked with level of parallelism set to 500. We achieved similar results with MARS disabled.

Note: We used .NET Core to run tests in Windows.

Expected behavior

The sample app should not crash and connections with MARS feature enabled should behave in the same way on both Linux and Windows.

Further technical details

Microsoft.Data.SqlClient version: 1.1.0 and 2.0.0-preview1.20021.1 .NET target: (Core 2.2 and Core 3.1) SQL Server version: (Azure SQL) Operating system: (Ubuntu 18.04 and AKS with Ubuntu 18.4)

Additional context We’ve been battling this issue for a long time now so we are happy to help in any way we can to get it resolved.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 40
  • Comments: 105 (44 by maintainers)

Commits related to this issue

Most upvoted comments

Good that this github issue exists (thanks!), we seem to have run into the same issue. Problem only appears when running the (asp core + ef core 3.1.2) app on Docker with Kubernetes with MARS on. Our background service handling lots of data would simply “die”, sometimes with and sometimes without any exception thrown. As it is a BackgroundService/IHostedService, the web app continues to run, just the BackgroundService is gone.

I turned MARS off and now it works.

I got two kinds of exceptions, this one with default settings of DbContext.

An exception occurred while iterating over the results of a query for context type '***'.
Microsoft.Data.SqlClient.SqlException (0x80131904): The connection is broken and recovery is not possible.  The connection is marked by the server as unrecoverable.  No attempt was made to restore the connection.
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__164_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

When setting the command timeout to five minutes, I got this exception - same as the opener of this issue.

An exception occurred while iterating over the results of a query for context type '"***"'."
""Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TrySetMetaData(_SqlMetaDataSet metaData, Boolean moreInfo)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
   at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)
   at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReader(RelationalCommandParameterObject parameterObject)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.InitializeReader(DbContext _, Boolean result)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.MoveNext()
ClientConnectionId:3d813f87-29be-4a5a-9e6d-faff5d0e0a5f
Error Number:-2,State:0,Class:11"

This issue caused lots of working days of diagnosing, as there is no clear indication what is wrong, hindering troubleshooting.

This bit us big time. Setting MultipleActiveResultSets=true caused lots of timeouts when running .net core app on linux pod on K8s. Removing it from connection string made the app very very fast and responsive and the “Connection Timeout Expired” errors are all gone.

Interesting. Perhaps all those extra eyes can help identify a solution. What I think we need is a cross platform way to do non-blocking async io which doesn’t rely on the threadpool. On windows that’s IOCP but dropping to platform specifics in this implementation is highly undesirable. The goal of the managed SNI is to have implementation parity for all supported OS’ so behaviour and perf are identical at this layer

@Wraith2

could you give my solution a try on your real Linux build and see what happens?

sure, I will test it today.

Like you I really don’t understand why we would see a platform difference when forcing the managed code but we need to identify it and stop it happening given the cost in support time.

I am a bit confuses by the code behavior and totally agree with you, we need to fix this issue and put reliability back to managed SNI.

The replication works on windows and forced managed mode. By works I mean that I get a connection failure request pretty quickly (should be 10 seconds) after I start the program. If you confirm that’s what’s expected then the repro is good.

Initial impression is that it’s simple threadpool starvation. If you add in:

ThreadPool.GetMaxThreads(out var workerThreads, out var completionThreads);
ThreadPool.SetMinThreads(count, completionThreads);

before you start the stopwatch you should find that it runs through successfully.

So why does that work? You’re using Parallel which is attempting to start count number of tasks and then have as many as the specified maxdop running at once. It does this by creating tasks and then queuing them to the threadpool. When you start those tasks they all try to connect to the database synchronously which blocks but it blocks in a particular way in the managed implementation, it ends up at https://github.com/dotnet/SqlClient/blob/bac3ab4494c2a5874e56f6d8e0b27265e04f9ca7/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNIMarsHandle.cs#L391 which sits in a loop waiting for a ManuarlResetEventSlim to be signalled telling it that a packet has arrived. The problem is that the thing that has to set that event need to run somewhere and if you’ve consumed all the threadpool threads with synchronous waits there’s no thread for it to run on, so the wait times out, and that causes the receive to fail and then the connection isn’t established and all this lack of being able to do things either manifests as a connection or query timeout eventually.

It’s be nice if we weren’t going sync over async in the library so this problem couldn’t occur inside the library. Changing that is challenging and short to mid term impractical, it’s pretty much a core rewrite of the non-windows implementation with all the reliability and performance concerns that it would come with. So don’t expect that to happen soon.

So why is this your problem? Well you’re the one using all the threadpool theads. The threadpool starts with a minimum of 10 ready threads in it and at various times tasks are queued to those threads and then completed. Unless you hit a period of sustained load on the pool it’s unlikely to need to grow past the minimum value so most of the time you’ll have a small number of threads. The threadpool is balanced for general workloads and a lot of research and tuning has gone into it. One of the behaviours that results from this tuning is that it doesn’t immediately add large numbers of threads when there are more tasks waiting to be processed than there are threads available to do them. The threadpool usually only adds one thread a second to the pool. If you load it with 100 tasks quickly you’ll get 10, then 11, then 12 etc climbing slowly.

So how do you fix it? Don’t do this. Don’t try to use large number of simultaneous synchronous connections like this. Unless you’re talking to a different server or database on each connection then it’s going to be more efficient to use a pipeline approach feeding results to an open connection and having it process multiple commands with a single connection. Or, switch to using async if possible because the awaited suspension will free up threadpool threads allowing them to be used or callback signalling.

Looks like this issue has hit The Register 😃 https://www.theregister.com/2020/12/11/buggy_behavior_sqlclient/

I tested on Unnix VM got same result as yours @pawelpabich. I am not sure why Managed SNI works on Windows, but fails on Unix. I am going to investigate 2 options:

  1. Classes dedicated in the driver for Unix ( to see if any is involved)
  2. Investigate threading in Unix and Windows (to see if a dedicated class for Unix is needed in this case)

Thanks again for your patience.

I am trying to change the logic to use SpinWait class and SpinCount parameter inside the driver.

As I understand it what is happening here is that SNIMarsHandle.Receive is being called and is either pulling a packet out of the _receivedPacketQueue if one is already available or if there isn’t a packet ready it hits the wait that you highlight above. At that point we really are waiting for data but we’re doing sync over async so we can’t release the current thread until we either have a packet or have failed by timing out. Whether you block or spin at this point it’s still got to be a sync behaviour.

Can you identify why the packet isn’t being received setting _packetEvent in the mars connection HandleReceiveComplete? tracing back through the jungle that is ManagedSNI I end up at SNIPacket.ReadFromStreamAsync which contains modern async code:

        public void ReadFromStreamAsync(Stream stream, SNIAsyncCallback callback)
        {
            stream.ReadAsync(_data, 0, _dataCapacity, CancellationToken.None)
                .ContinueWith(
                    continuationAction: _readCallback,
                    state: callback,
                    CancellationToken.None,
                    TaskContinuationOptions.DenyChildAttach,
                    TaskScheduler.Default
                );
        }

and that is going to use a threadpool thread to implement callback, no threadpool thread available no packet receipt. It might be worth looking at a custom task scheduler which handles this and give it a dedicated thread so threadpool exhaustion can’t affect it. Let me know what you think.

Also looking at that we’ve not observing the read task which we should be.

Here is an update on the issue. As it was mentioned before the issue was about using all available threads while using MangedSNI on windows or Linux. It took me a while to understand the pattern and connections in different classes.

When a packet is received there is a part of code that makes _packeEvent ,which is an implementation of ManuaResetEventSlim, to wait for connectiontime period. By default connection timeout is 15 seconds. if you run the application you will see a delay of 15-17 seconds and errors start to run down. If you increasethe ConnectTimeOut then the delay will be increased. This all go back to implementation of SNIMarsHandle inside the driver. The part that makes the delay is inside public override uint Receive(out SNIPacket packet, int timeoutInMilliseconds) which is get called, at some point, in the hierarchy by syncOverAsync method. it forces the threads to wait by

if (!_packetEvent.Wait(timeoutInMilliseconds))
                    {
                        SNILoadHandle.SingletonInstance.LastError = new SNIError(SNIProviders.SMUX_PROV, 0, SNICommon.ConnTimeoutError, string.Empty);
                        return TdsEnums.SNI_WAIT_TIMEOUT;
                    }

I am trying to change the logic to use SpinWait class and SpinCount parameter inside the driver.

I will update here hopefully by end of this week.

@junchen2000 Hey, man, are you sure you’d like to share connection string containing password? (even if it is local and test)

Hi Team,

We have multiple users reporting the same issue. The issue only happens on Linux. The issue does not happen on Windows.

To narrow down the issue, we have simplified the reproducing code to the following:

using System;
using System.ComponentModel.DataAnnotations;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using Microsoft.Data.SqlClient;
namespace reprocli
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                var count = 40;
                var connectionString = "Server=tcp:<myAzureSQLName>.database.windows.net,1433;Initial Catalog=xxx;Persist Security Info=False;User ID=xxxxxxxxx;Password=xxxxxxx;MultipleActiveResultSets=TRUE;Connect Timeout=0;Max Pool Size=20;Packet Size=12000";
                var total = Stopwatch.StartNew();
                total.Restart();
                Enumerable.Range(0, count)
                    .AsParallel()
                    .WithDegreeOfParallelism(count)
                    .ForAll(n => Scenario4(connectionString, n));
                Console.WriteLine($"Total: {total.Elapsed}");
            }
            catch (Exception e)
            {
                Console.WriteLine(e);
                throw;
            }
        }
        private static void Scenario4(string connString, int number)
        {
            var userStopWatch = Stopwatch.StartNew();
            var buffer = new object[100];
            {
                using (var connection = new SqlConnection(connString))
                {
                    var StartTimestamp = DateTime.Now;
                    connection.Open();
                    var FinishTimestamp = DateTime.Now;
                    TimeSpan span = FinishTimestamp.Subtract(StartTimestamp);
                    Console.WriteLine("Task " + number.ToString("00") + ": Connection.Open started at " + StartTimestamp.ToString() + ";ended at " + FinishTimestamp.ToString() + " - Elapsed " + span.TotalMilliseconds);
                    StartTimestamp = DateTime.Now;
                    using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
                    {
                        FinishTimestamp = DateTime.Now;
                        span = FinishTimestamp.Subtract(StartTimestamp);
                        Console.WriteLine("Task " + number.ToString("00") + ": BeginTransaction started at " + StartTimestamp.ToString() + ";ended at " + FinishTimestamp.ToString() + " - Elapsed " + span.TotalMilliseconds);
                        transaction.Commit();
                    }
                    connection.Close();
                }
            }
            userStopWatch.Stop();
            Console.WriteLine("Task " + number.ToString("00") + $": Total - Elapsed: {userStopWatch.Elapsed}");
        }
    }
}

Note in the above code:

  1. We have spawned 40 threads in parallel. With only 40 threads, thread pool starvation should not be a problem. (We can only see 51 threads when ManualResetEventSlim is being signaled)

  2. Each thread only connects to Azure SQL Database for once. In that connection, it only starts a transaction, and commits the transaction. There’re no queries.

  3. We have modified the connection string to include several properties:

a. Connect Timeout=0 The execution timeout error can be avoided by setting connect timeout to 0. Without hitting the timeout, we can observe the slow behavior more clearly.

b. Max Pool Size=20 We specify the pool size to half number of the parallel threads so that we can see how connection pool takes effect here.

c. Packet Size=12000 We increase the packet size, but it does not show any differences.

  1. We have logged the timestamps and duration of Connectoin.Open and BeginTransaction for each thread/task.

Test result for Windows

Total execution took only 5.8 seconds. Each thread running in parallel took no more than 3 seconds. Most of the time were spent on Connection.Open:

Task 03: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 1252.0322
Task 03: BeginTransaction started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 99.9115
Task 07: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 1454.2317
Task 03: Total - Elapsed: 00:00:02.0012941
Task 07: BeginTransaction started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 94.4907
Task 01: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 1641.8124
Task 07: Total - Elapsed: 00:00:02.1885435
Task 01: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.4182
Task 01: Total - Elapsed: 00:00:02.3745601
Task 00: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 1827.7951
Task 00: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.6793
Task 00: Total - Elapsed: 00:00:02.5601853
Task 11: Connection.Open started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 558.5962
Task 08: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2096.162
Task 11: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.5295
Task 08: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.8543
Task 11: Total - Elapsed: 00:00:00.7444362
Task 05: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2199.2868
Task 08: Total - Elapsed: 00:00:02.8291427
Task 06: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2282.408
Task 05: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.3598
Task 06: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.9552
Task 05: Total - Elapsed: 00:00:02.9313097
Task 04: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2384.6057
Task 06: Total - Elapsed: 00:00:03.0157004
Task 09: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2360.5249
Task 04: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.6181
Task 09: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 93.0764
Task 04: Total - Elapsed: 00:00:03.1168025
Task 10: Connection.Open started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 1463.5762
Task 09: Total - Elapsed: 00:00:02.5470265
Task 17: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 372.9687
Task 10: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.4596
Task 17: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.8557
Task 10: Total - Elapsed: 00:00:01.6491569
Task 13: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 1113.3493
Task 17: Total - Elapsed: 00:00:00.5603184
Task 14: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 1014.9437
Task 13: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.3817
Task 14: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 93.0985
Task 02: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 2937.5539
Task 13: Total - Elapsed: 00:00:01.2987254
Task 16: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 741.6992
Task 02: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.2785
Task 14: Total - Elapsed: 00:00:01.2020219
Task 18: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 645.2825
Task 16: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.3808
Task 02: Total - Elapsed: 00:00:03.6693590
Task 19: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 653.5425
Task 18: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.8252
Task 16: Total - Elapsed: 00:00:00.9270470
Task 20: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 556.2838
Task 19: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.6183
Task 18: Total - Elapsed: 00:00:00.8327458
Task 12: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 1598.5975
Task 20: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.3875
Task 19: Total - Elapsed: 00:00:00.8397575
Task 21: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 701.6779
Task 12: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 93.0304
Task 20: Total - Elapsed: 00:00:00.7413675
Task 22: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 655.9256
Task 21: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.6944
Task 12: Total - Elapsed: 00:00:01.7852348
Task 23: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 648.3088
Task 22: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.4581
Task 21: Total - Elapsed: 00:00:00.8880851
Task 24: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 652.2877
Task 22: Total - Elapsed: 00:00:00.8415095
Task 25: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 555.9168
Task 23: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.791
Task 24: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.6832
Task 25: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.3799
Task 23: Total - Elapsed: 00:00:00.8348116
Task 26: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.4023
Task 24: Total - Elapsed: 00:00:00.8388235
Task 28: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 555.3253
Task 25: Total - Elapsed: 00:00:00.7411146
Task 27: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.6749
Task 26: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.9759
Task 28: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 93.3051
Task 27: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.7269
Task 26: Total - Elapsed: 00:00:00.7461889
Task 29: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.9274
Task 15: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 1764.509
Task 28: Total - Elapsed: 00:00:00.7427732
Task 31: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 557.6932
Task 15: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 91.3374
Task 27: Total - Elapsed: 00:00:00.7468952
Task 34: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 372.3959
Task 29: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.7142
Task 15: Total - Elapsed: 00:00:01.9484886
Task 35: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 371.0839
Task 31: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.5221
Task 29: Total - Elapsed: 00:00:00.7463724
Task 34: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.283
Task 32: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.6628
Task 35: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 91.6887
Task 31: Total - Elapsed: 00:00:00.7448745
Task 33: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 561.1435
Task 34: Total - Elapsed: 00:00:00.5597399
Task 36: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 449.6328
Task 32: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.8304
Task 35: Total - Elapsed: 00:00:00.5563786
Task 37: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 465.2997
Task 33: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.5839
Task 36: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.3554
Task 32: Total - Elapsed: 00:00:00.7463895
Task 38: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 467.4738
Task 37: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 91.6418
Task 33: Total - Elapsed: 00:00:00.7467624
Task 39: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 464.5236
Task 36: Total - Elapsed: 00:00:00.6352385
Task 38: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.8737
Task 37: Total - Elapsed: 00:00:00.6496484
Task 39: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.489
Task 38: Total - Elapsed: 00:00:00.6536980
Task 39: Total - Elapsed: 00:00:00.6576849
Task 30: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:01 PM - Elapsed 1316.457
Task 30: BeginTransaction started at 4/27/2020 5:27:01 PM;ended at 4/27/2020 5:27:01 PM - Elapsed 92.3981
Task 30: Total - Elapsed: 00:00:01.5020711
Total: 00:00:05.7179428

Test Result for Linux

Run the same on Linux. Total execution took 34 seconds. The connection.Open does not really execute in parallel. Once a BeginTransaction starts in one connection, it slows down the other thread to start Connection.Open. The more connection.open, the more delay. Once the connection pool was full, it was blocked for x0 seconds until one of the BeginTransaction finished. Then the later connections queued into the connection pool could run properly without any delays:

dragon@marlonvm:~/k8s-sql-timeout-repro/source/reprocli$ dotnet run
Task 04: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 481.6339
Task 00: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 510.3399
Task 02: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 556.4548
Task 01: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 591.7924
Task 03: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 617.9443
Task 05: Connection.Open started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 27.8998
Task 06: Connection.Open started at 4/27/2020 9:34:33 AM;ended at 4/27/2020 9:34:33 AM - Elapsed 30.827
Task 07: Connection.Open started at 4/27/2020 9:34:33 AM;ended at 4/27/2020 9:34:34 AM - Elapsed 27.4854
Task 08: Connection.Open started at 4/27/2020 9:34:34 AM;ended at 4/27/2020 9:34:35 AM - Elapsed 31.2146
Task 09: Connection.Open started at 4/27/2020 9:34:35 AM;ended at 4/27/2020 9:34:35 AM - Elapsed 30.8095
Task 10: Connection.Open started at 4/27/2020 9:34:36 AM;ended at 4/27/2020 9:34:36 AM - Elapsed 30.4743
Task 11: Connection.Open started at 4/27/2020 9:34:37 AM;ended at 4/27/2020 9:34:37 AM - Elapsed 30.617
Task 12: Connection.Open started at 4/27/2020 9:34:38 AM;ended at 4/27/2020 9:34:38 AM - Elapsed 29.1201
Task 13: Connection.Open started at 4/27/2020 9:34:39 AM;ended at 4/27/2020 9:34:39 AM - Elapsed 36.9007
Task 14: Connection.Open started at 4/27/2020 9:34:40 AM;ended at 4/27/2020 9:34:40 AM - Elapsed 30.8955
Task 15: Connection.Open started at 4/27/2020 9:34:41 AM;ended at 4/27/2020 9:34:41 AM - Elapsed 31.2848
Task 16: Connection.Open started at 4/27/2020 9:34:42 AM;ended at 4/27/2020 9:34:42 AM - Elapsed 29.2114
Task 17: Connection.Open started at 4/27/2020 9:34:43 AM;ended at 4/27/2020 9:34:43 AM - Elapsed 24.7978
Task 18: Connection.Open started at 4/27/2020 9:34:44 AM;ended at 4/27/2020 9:34:44 AM - Elapsed 32.4253
Task 19: Connection.Open started at 4/27/2020 9:34:45 AM;ended at 4/27/2020 9:34:45 AM - Elapsed 37.3471
Task 08: BeginTransaction started at 4/27/2020 9:34:35 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 30011.9233
Task 00: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32982.9174
Task 09: BeginTransaction started at 4/27/2020 9:34:35 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 29512.0317
Task 10: BeginTransaction started at 4/27/2020 9:34:36 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 28508.912
Task 02: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32936.5808
Task 11: BeginTransaction started at 4/27/2020 9:34:37 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 27462.8604
Task 04: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 33008.9028
Task 06: BeginTransaction started at 4/27/2020 9:34:33 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 31512.4423
Task 07: BeginTransaction started at 4/27/2020 9:34:34 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 31016.4188
Task 12: BeginTransaction started at 4/27/2020 9:34:38 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 26512.2184
Task 13: BeginTransaction started at 4/27/2020 9:34:39 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 25504.2651
Task 03: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32875.1887
Task 05: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32517.823
Task 01: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32901.4141
Task 14: BeginTransaction started at 4/27/2020 9:34:40 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 24504.4908
Task 15: BeginTransaction started at 4/27/2020 9:34:41 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 23473.666
Task 16: BeginTransaction started at 4/27/2020 9:34:42 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 22510.7361
Task 17: BeginTransaction started at 4/27/2020 9:34:43 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 21515.0665
Task 18: BeginTransaction started at 4/27/2020 9:34:44 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 20984.4067
Task 19: BeginTransaction started at 4/27/2020 9:34:45 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 19970.1126
Task 22: Connection.Open started at 4/27/2020 9:34:48 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 17040.9565
Task 24: Connection.Open started at 4/27/2020 9:34:49 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 15542.3034
Task 20: Connection.Open started at 4/27/2020 9:34:46 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 19045.6091
Task 13: Total - Elapsed: 00:00:25.5511950
Task 06: Total - Elapsed: 00:00:31.5553711
Task 08: Total - Elapsed: 00:00:30.0551484
Task 15: Total - Elapsed: 00:00:23.5144020
Task 10: Total - Elapsed: 00:00:28.5512858
Task 28: Connection.Open started at 4/27/2020 9:34:53 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 12040.2609
Task 25: Connection.Open started at 4/27/2020 9:34:50 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 14541.4863
Task 19: Total - Elapsed: 00:00:20.0156375
Task 26: Connection.Open started at 4/27/2020 9:34:51 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 14040.7895
Task 21: Connection.Open started at 4/27/2020 9:34:47 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 18045.5826
Task 01: Total - Elapsed: 00:00:33.5539375
Task 07: Total - Elapsed: 00:00:31.0564162
Task 00: Total - Elapsed: 00:00:33.5541849
Task 30: Connection.Open started at 4/27/2020 9:34:55 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 10039.1023
Task 18: Total - Elapsed: 00:00:21.0258884
Task 23: Connection.Open started at 4/27/2020 9:34:49 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 16044.2951
Task 09: Total - Elapsed: 00:00:29.5554896
Task 31: Connection.Open started at 4/27/2020 9:34:56 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 9036.8845
Task 03: Total - Elapsed: 00:00:33.5544169
Task 17: Total - Elapsed: 00:00:21.5493243
Task 33: Connection.Open started at 4/27/2020 9:34:58 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 7036.9824
Task 12: Total - Elapsed: 00:00:26.5529391
Task 34: Connection.Open started at 4/27/2020 9:34:59 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 6035.7954
Task 02: Total - Elapsed: 00:00:33.5545999
Task 35: Connection.Open started at 4/27/2020 9:35:00 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 5035.7386
Task 16: Total - Elapsed: 00:00:22.5499312
Task 36: Connection.Open started at 4/27/2020 9:35:01 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 4035.4121
Task 04: Total - Elapsed: 00:00:33.5546821
Task 14: Total - Elapsed: 00:00:24.5453515
Task 27: Connection.Open started at 4/27/2020 9:34:52 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 13040.8999
Task 11: Total - Elapsed: 00:00:27.5058150
Task 29: Connection.Open started at 4/27/2020 9:34:54 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 11036.1083
Task 37: Connection.Open started at 4/27/2020 9:35:02 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 3035.0194
Task 32: Connection.Open started at 4/27/2020 9:34:57 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 8038.2898
Task 38: Connection.Open started at 4/27/2020 9:35:03 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2023.0241
Task 05: Total - Elapsed: 00:00:32.5594002
Task 39: Connection.Open started at 4/27/2020 9:35:04 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1033.8102
Task 25: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.9938
Task 21: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8662
Task 22: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.5818
Task 28: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.0527
Task 33: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.5791
Task 31: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.7331
Task 30: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.0152
Task 24: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.8402
Task 23: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8164
Task 26: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.3343
Task 20: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.898
Task 34: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8759
Task 35: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8686
Task 36: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8287
Task 37: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.7166
Task 29: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.889
Task 32: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8629
Task 38: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8027
Task 27: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.0957
Task 39: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8029
Task 25: Total - Elapsed: 00:00:14.5450434
Task 21: Total - Elapsed: 00:00:18.0492143
Task 22: Total - Elapsed: 00:00:17.0452281
Task 28: Total - Elapsed: 00:00:12.0440717
Task 33: Total - Elapsed: 00:00:07.0400924
Task 30: Total - Elapsed: 00:00:10.0426584
Task 31: Total - Elapsed: 00:00:09.0402773
Task 23: Total - Elapsed: 00:00:16.0478210
Task 26: Total - Elapsed: 00:00:14.0450272
Task 24: Total - Elapsed: 00:00:15.5469980
Task 34: Total - Elapsed: 00:00:06.0394797
Task 35: Total - Elapsed: 00:00:05.0394576
Task 36: Total - Elapsed: 00:00:04.0390648
Task 37: Total - Elapsed: 00:00:03.0384628
Task 38: Total - Elapsed: 00:00:02.0265013
Task 32: Total - Elapsed: 00:00:08.0421579
Task 27: Total - Elapsed: 00:00:13.0456037
Task 29: Total - Elapsed: 00:00:11.0406611
Task 20: Total - Elapsed: 00:00:19.0509156
Task 39: Total - Elapsed: 00:00:01.0373777
Total: 00:00:33.5828616

The interesting thing is that from the managed call stack, we see the threads were waiting on SNIMarsHandle.Receive:

(lldb) clrstack
OS Thread Id: 0x9654 (20)
        Child SP               IP Call Site
00007EFF421D6C00 00007effbe084360 [GCFrame: 00007eff421d6c00]
00007EFF421D6CE0 00007effbe084360 [HelperMethodFrame_1OBJ: 00007eff421d6ce0] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00007EFF421D6E10 00007EFF42F207B4 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
00007EFF421D6EA0 00007EFF455A4595 Microsoft.Data.SqlClient.SNI.SNIMarsHandle.Receive(Microsoft.Data.SqlClient.SNI.SNIPacket ByRef, Int32)
00007EFF421D6F10 00007EFF44110B12 Microsoft.Data.SqlClient.SNI.SNIProxy.ReadSyncOverAsync(Microsoft.Data.SqlClient.SNI.SNIHandle, Microsoft.Data.SqlClient.SNI.SNIPacket ByRef, Int32)
00007EFF421D6F40 00007EFF44110A9D Microsoft.Data.SqlClient.SNI.TdsParserStateObjectManaged.ReadSyncOverAsync(Int32, UInt32 ByRef)
00007EFF421D6F90 00007EFF441103C1 Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
00007EFF421D6FF0 00007EFF441102B9 Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
00007EFF421D7010 00007EFF4412D7FB Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
00007EFF421D7030 00007EFF4412D6EE Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte ByRef)
00007EFF421D7060 00007EFF4412C778 Microsoft.Data.SqlClient.TdsParser.TryRun(Microsoft.Data.SqlClient.RunBehavior, Microsoft.Data.SqlClient.SqlCommand, Microsoft.Data.SqlClient.SqlDataReader, Microsoft.Data.SqlClient.BulkCopySimpleResultSet, Microsoft.Data.SqlClient.TdsParserStateObject, Boolean ByRef)
00007EFF421D7200 00007EFF4412C132 Microsoft.Data.SqlClient.TdsParser.Run(Microsoft.Data.SqlClient.RunBehavior, Microsoft.Data.SqlClient.SqlCommand, Microsoft.Data.SqlClient.SqlDataReader, Microsoft.Data.SqlClient.BulkCopySimpleResultSet, Microsoft.Data.SqlClient.TdsParserStateObject)
00007EFF421D7270 00007EFF455A3A53 Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[], TransactionManagerRequestType, System.String, TransactionManagerIsolationLevel, Int32, Microsoft.Data.SqlClient.SqlInternalTransaction, Microsoft.Data.SqlClient.TdsParserStateObject, Boolean)
00007EFF421D7320 00007EFF455A2723 Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest, System.String, System.Data.IsolationLevel, Microsoft.Data.SqlClient.SqlInternalTransaction, Boolean)
00007EFF421D73F0 00007EFF455A235D Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest, System.String, System.Data.IsolationLevel, Microsoft.Data.SqlClient.SqlInternalTransaction, Boolean)
00007EFF421D7440 00007EFF455A18D0 Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(System.Data.IsolationLevel, System.String, Boolean)
00007EFF421D74D0 00007EFF455A14EF Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(System.Data.IsolationLevel, System.String)
00007EFF421D7520 00007EFF455A13EF Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(System.Data.IsolationLevel)
00007EFF421D7540 00007EFF4340E81B reprocli.Program.Scenario4(System.String, Int32) [/home/dragon/k8s-sql-timeout-repro/source/reprocli/Program.cs @ 54]

And we can track to where ManualResetEventSlim was signaled:

(lldb) clrstack
OS Thread Id: 0x9d24 (51)
        Child SP               IP Call Site
00007EFEA8FF85B0 00007EFF42F20C46 System.Threading.ManualResetEventSlim.UpdateStateAtomically(Int32, Int32) [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 720]
00007EFEA8FF85F0 00007EFF42F2026A System.Threading.ManualResetEventSlim.Set(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 310]
00007EFEA8FF8640 00007EFF42F2022D System.Threading.ManualResetEventSlim.Set() [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 297]
00007EFEA8FF8650 00007EFF455C644B Microsoft.Data.SqlClient.SNI.SNIMarsHandle.HandleReceiveComplete(Microsoft.Data.SqlClient.SNI.SNIPacket, Microsoft.Data.SqlClient.SNI.SNISMUXHeader)
00007EFEA8FF86E0 00007EFF455C54A0 Microsoft.Data.SqlClient.SNI.SNIMarsConnection.HandleReceiveComplete(Microsoft.Data.SqlClient.SNI.SNIPacket, UInt32)
00007EFEA8FF8750 00007EFF455C4F1B Microsoft.Data.SqlClient.SNI.SNIPacket+<>c__DisplayClass33_0.<ReadFromStreamAsync>b__0(System.Threading.Tasks.Task`1<Int32>)
00007EFEA8FF8800 00007EFF4310DE1A System.Threading.Tasks.ContinuationTaskFromResultTask`1[[System.Int32, System.Private.CoreLib]].InnerInvoke() [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/TaskContinuation.cs @ 135]
00007EFEA8FF8820 00007EFF42F3A142 System.Threading.Tasks.Task+<>c.<.cctor>b__274_0(System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2427]
00007EFEA8FF8830 00007EFF42F1ECE2 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 289]
00007EFEA8FF8870 00007EFF42F345C1 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2389]
00007EFEA8FF88F0 00007EFF42F343D8 System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2321]
00007EFEA8FF8910 00007EFF42F3436B System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2312]
00007EFEA8FF8920 00007EFF42F28B14 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs @ 663]
00007EFEA8FF89A0 00007EFF42F16B3B System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() [/_/src/System.Private.CoreLib/src/System/Threading/ThreadPool.CoreCLR.cs @ 29]
00007EFEA8FF8D50 00007effbc817f7f [DebuggerU2MCatchHandlerFrame: 00007efea8ff8d50]

The above managed call stack gives the impression that thread was waiting to receive data from the network.

However, from TCPDump trace, we see server side had already sent back the packet (Frame 487) and client had acknowledged it (Frame 562). And then the client side waited for 35+ seconds, and then sent another packet to server (Frame 2177). And later network traffic became normal without delays.

(This TCPDump trace was captured in a different run. So the delay time was slightly different from the above output.)

image

It looks like the client socket layer received the data, but it did not notify the application in real time and was blocked somewhere.

How can we tell what blocked it?

@pawelpabich

I have’t been responding lately, but just to let you know I am investigating this issue. I’m able to reproduce the problem in linux docker containers when connecting to any SQL Server, irrespective of being Azure DB and it does seem to be related to linux memory management, as when running the same app on Windows using the same codepath of Managed SNI the problem does not reproduce. @Wraith2 seems to have provided some details.

However I’m investigating if it is specifically related to containerized environment limitations. I’m not sure at this point if this has anything to with transactions, I don’t think so… but I’ll share updates soon!

From the little I know of it threads are even more expensive on Linux than they are on windows, iirc windows is setup for fast thread creation and Linux for fast process creation. So as you say not cheap and it’s a poor way to scale. Setting MinThreads on the pool is a way to get around it. Not a great idea for production but it’s a public api so it is provided andf supported for you to use if you need to do so. Read around the method and see if you think it’s an acceptable thing to do in your case?

SQL server is not the bottleneck, the server is barely doing anything throughout the benchmark. The problem is the client process thread creation as far as I can see.

A1: Turning off mars on Linux removes an intermediate handle object and changes where and how receives are done. In the code I linked to you see the Receive using a packetEvent wait handle. Without mars you’d see it’s caller calling into https://github.com/dotnet/SqlClient/blob/bac3ab4494c2a5874e56f6d8e0b27265e04f9ca7/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNITcpHandle.cs#L493 which doesn’t do that kind of wait at all and instead does a direct socket call. So you don’t hit the problem because the wait is done in the socket implementation which may not require a threadpool. Really if you can avoid using mars I would advise doing so, it doesn’t seem to have many upsides and significantly increases the likelihood of issues on the managed network implementation. So yes one packet really can’t make all the difference because that one packet on the mars path immediately causes a synchronous wait that can’t be unlocked without the resource it’s blocking.

A2: At a guess if it’s coming from the sqlclient side I’d say it’s because the socket is being shutdown because of the failure. Once successfully connected sockets are kept open until they’re removed from the pool even if you close the SqlConnection object the SqlConnectionInternal object it contains stays alive for a while unless there is a failure which dooms it to immediate closure. If it’s from the server side I couldn’t say.

Do you have a reproduction?

I am not sure there is much point as the design issue seems already acknowledged by greater minds than I, in this discussion you linked here

We have a mostly under utilised Azure Sql database (auto scaling) and ubuntu VM running a Linux docker container. We aren’t using MARS. I am curious to know if you / anyone has tried that repro you linked above for the non MARS case, as my understanding is that the same issues in linux will surface at some point with MARS disabled - as it still does sync over async and still relies on a bcl network api that requires threadpool?

I think you’ve basically said the only real solution (workaround) for now, is to prep the threadpool manually by boosting the min threads in advance of placing load… However I’m wondering how best one can anticipate the load to adjust the thread pool min thread count in advance? In our scenario we have cron jobs which cause stuff to ramp up but those jobs query external systems and depending on the data read from external systems their may be more or less work to do for the threadpool. Also if this is the official workaround is it possible to get some better guidance for how to set min / max threads to a reasonable number and whether we should decrease it again etc? Perhaps also is it worth talking with the EF Core team to ensure sql server examples or docs, show or atleast mention this issue for awareness? Otherwise its just going to catch more and more teams out.

And if I have understood correctly and the Microsoft sql client team are aware of the issue with sync over async and scalability on linux, may I ask if there are any tests that they have done to probe for points of failure and increase the understanding of the acknowledged issue - i.e what sorts of loads will cause the issue that is acknowledged in the design? This will be helpful for us to create a repro, but also teams designing new apps with load patterns that sql client has issues with, may want to use something else like postgres etc.

@cheenamalhotra thanks, will do. Enabling MARS and we still get the same issue just FYI.

@ErikEJ @Wraith2 Many thanks for your input!

Regarding opening 100 connections: We do not open any connections manually. That is all handled by EF Core 5. So I think we should be fine there and will see doing some load-testing.

MARS is disabled in our connection string. And all usages of EF are implemented with async.

There are some issues with Linux, as you can see here - switching platform requires due diligence!

That’s why we are testing the app. We didn’t just deploy it to production.

I also agree that The fundamental building blocks of the platform need to be predictable no matter what supported OS they run on. but we must acknowledge that is fantasy… unless we mean predictable=same output… if predictable=same output in the same time frame we are going to have a big problem that will probably never be predictable.

When dealing with such specific matters (OS Thread pooling strategy such as IOCP for windows vs. ePoll for linux) I think its completely reasonable to have OS specific implementations because the underlying strategy is different! It would be lovely to have a cross-platform API to unify this feature but the hard truth is that there is not and we need to do deliver the best performance we can, so, with that in mind, I think that a nasty, cringe-worthy, #if Linux approach is the way to go, until new opportunities comes along.

What I think we need is a cross platform way to do non-blocking async io which doesn’t rely on the threadpool. On windows that’s IOCP but dropping to platform specifics in this implementation is highly undesirable. The goal of the managed SNI is to have implementation parity for all supported OS’ so behaviour and perf are identical at this layer

I agree. The fundamental building blocks of the platform need to be predictable no matter what supported OS they run on.

Are there any updates on a fix for this?

We’ve been having some major problems and have started rewriting anything that touches the database to use async as @cheenamalhotra recommended, but it’s a massive refactor for us. Should we see any improvement from moving our most hit calls to be async, or will we only see a reduction in problems when we are no longer making any synchronous calls?

We’re also considering moving back to Windows because it sounds like that would resolve the issue here?

This changes multi-threaded code into single-threaded code. Unfortunately, that is not a realistic workaround.

It took longer than what we expected and it is going to take longer to solve the issue completely, but a quick workaround to use PLINQ with the M.D.S is changing the PLINQ part to the following: Enumerable.Range(0, count) .AsParallel() .WithDegreeOfParallelism(count) .ToList().ForEach(n => Scenario4(connectionString, n)); using this condition the code will run with no delay and hesitation.

Hi, I have exactly the same issue as @tunger. I’m running under .net core 3.1.3 (Linux docker container as Azure Web App) with Azure MSSQL DB. I’m just using Dapper instead of EF. With MARS disabled everything works fine.

I do experience this issue on Azure Web App For Containers (on Linux), but not when running the same container locally on Docker Desktop (on Windows). Hope this helps.

Well, about the validity of the tests, MAC OSX behaves very similarly to Linux, including the thread management, after all Mac OSX is a unix based… So I would take my reproducing with a grain of salt, but I’m quite confident on the analysis…

About the transactions blocking each other, there are several workarounds, but they rely on actual code changes and business-rules validations and etc. This is a database issue, not a implementation issue. If you’re out-of-your-mind you could remove the transaction all together (and when questioned, you can say that the fever from the coronavirus made you hallucinate), but in a more realistic scenario, you can try to set the isolation level to “ReadUncommited”, this way the blocking/deadlocking will decrease immensely…