SqlClient: TransactionAbortedException when performing queries in parallel inside a transaction scope
Describe the bug
When performing multiple queries in parallel, and each pair of 2 queries are inside a transaction scope, an unexpected transaction error is thrown (see below).
Exception message: System.Transactions.TransactionAbortedException: The transaction has aborted.
Stack trace:
Unhandled exception. System.Transactions.TransactionAbortedException: The transaction has aborted.
---> System.Transactions.TransactionPromotionException: Failure while attempting to promote transaction.
---> System.InvalidOperationException: The requested operation cannot be completed because the connection has been broken.
at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote()
--- End of inner exception stack trace ---
at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote()
at System.Transactions.TransactionStatePSPEOperation.PSPEPromote(InternalTransaction tx)
at System.Transactions.TransactionStateDelegatedBase.EnterState(InternalTransaction tx)
--- End of inner exception stack trace ---
at System.Transactions.TransactionStateAborted.CheckForFinishedTransaction(InternalTransaction tx)
at System.Transactions.EnlistableStates.Promote(InternalTransaction tx)
at System.Transactions.Transaction.Promote()
at System.Transactions.TransactionInterop.ConvertToDistributedTransaction(Transaction transaction)
at System.Transactions.TransactionInterop.GetExportCookie(Transaction transaction, Byte[] whereabouts)
at Microsoft.Data.SqlClient.SqlInternalConnection.GetTransactionCookie(Transaction transaction, Byte[] whereAbouts)
at Microsoft.Data.SqlClient.SqlInternalConnection.EnlistNonNull(Transaction tx)
at Microsoft.Data.ProviderBase.DbConnectionInternal.ActivateConnection(Transaction transaction)
at Microsoft.Data.ProviderBase.DbConnectionPool.PrepareConnection(DbConnection owningObject, DbConnectionInternal obj, Transaction transaction)
at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInterna
l& connection)
at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
at Microsoft.Data.SqlClient.SqlConnection.InternalOpenAsync(CancellationToken cancellationToken)
--- End of stack trace from previous location ---
To reproduce
Below code reproduces the error. It runs 100K times 2 queries in parallel where each 2 queries are inside a single transaction scope. Note that after each query the database connection is disposed (and thus closed). This is important to trigger the error. When we change the code, and use 1 connection for both queries and only open 1 connection the error does not occur (see also code comments below).
using System.Transactions;
using Microsoft.Data.SqlClient;
static async Task PerformTransactionWithQuery(int num)
{
try
{
using (new TransactionScope(TransactionScopeAsyncFlowOption.Enabled))
{
// Adding "max pool size=1000;" to the connection string seems to trigger the problem less often
string connStr = @"Server=.\SQLEXPRESS;Database=master;Trusted_Connection=True;Encrypt=False";
var query = "SELECT COUNT(*) FROM sys.dm_tran_active_transactions";
await using (var dbConn = new SqlConnection(connStr))
{
await dbConn.OpenAsync();
await using (var command1 = new SqlCommand(query, dbConn))
{
await command1.ExecuteScalarAsync();
}
} // Connection is disposed (and thus closed)
await using (var dbConn = new SqlConnection(connStr))
{
// Reopening the connection triggers the following error:
// System.Transactions.TransactionAbortedException: The transaction has aborted.
//
// NB. Using a single connection and opening it once does NOT trigger the error
await dbConn.OpenAsync();
await using (var command2 = new SqlCommand(query, dbConn))
{
await command2.ExecuteScalarAsync();
}
} // Connection is disposed (and thus closed)
//Do not complete transaction
}
}
catch (Exception e)
{
Console.WriteLine($"Failed {num}");
throw;
}
}
var tasks = Enumerable.Range(0, 100000).ToList().Select(PerformTransactionWithQuery);
await Task.WhenAll(tasks);
Expected behavior
Being able to use multiple connections (with the same connection string) in sequence within the same transaction without errors.
Further technical details
Microsoft.Data.SqlClient version: 4.1.0 .NET target: .NET 6 SQL Server version: SQL Server 2019 Operating system: Windows 11
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 5
- Comments: 48 (30 by maintainers)
I’ve spent of time and I think I’ve understood what’s going on.
Consider the following code:
As a quick test, I’ve commented out the event handler registration (this line); this means that SqlClient no longer listens to Transaction.TransactionCompleted. Before this change, the repro in this issue consistently failed, whereas after the change it consistently completes successfully. Since the bug apparently isn’t present in the .NET Framework version, I’d look at how it’s working and what changes were made to the .NET Core one which led to this.
I’m not yet quite sure why this bug would lead to promotion to a distributed transaction as we’re seeing above, but I’m guessing that with the dooming of arbitrary connections, the state could easily get screwed up and lead to undefined results.
@DavoudEshtehari - Yes that’s what I noted earlier. However what I didn’t notice at the time was that this causes the transaction to be distributed. Running under .NET 6 on linux this will then fail as distributed transactions are not allowed.
Any viable workaround for this issue? We’re getting occasional errors (no exact metrics, about one in 10k connections), and have no idea how to bypass the issue for now. We tried increasing pool size, but it did not help. Is the only way to avoid this - to disable connection pooling? Performance hit will be considerable, and I’d like to avoid it if possible.
Exception we’re getting is the same as in original post. Environment is asp.net, with each request opening up to 10-20 short lived connections.
I can’t think of any good reason for that to happen. It simply doesn’t make sense.
@DavoudEshtehari ah OK, I missed that.
In general, it seems that the connection cannot be returned to the (general) pool until the transaction has fully completed; since SqlClient registers for the TransactionCompleted event, that’s when this should happen - not earlier. Alternatively, it’s interesting to ask why SqlClient registers for the event in the first place - the transaction enlistment (SqlDelegatedTransaction) should already receive all the necessary events. I’m guessing there’s a reason why the event registration is done, but removing that would certainly simplify a lot of things (and prevent bugs such as this and #1800).
Anyway, I’d be happy to help with any further thinking on this.
@roji thank you for the update on this issue. Unfortunately, this bug exists against the .NET Framework as well: https://github.com/dotnet/SqlClient/issues/1675#issuecomment-1289817442.
@cheenamalhotra I was also confronted with unexpected escalation to a distributed transaction recently (https://github.com/dotnet/efcore/issues/29054), and was assuming it was related to the .NET 7.0 distributed transaction changes… It’s interesting that this occurs on .NET 6.0.
I’m not sure how this could be happening. I do know that SqlInternalConnection.EnlistNonNull would only call SqlInternalConnection.GetTransactionCookie (which causes the escalation) if the call to Transaction.EnlistPromotableSinglePhase return false (barring bugs in Sys.Tx of course). This would mean that there are somehow two distinct connections at play; some sort of bug which messes up the TransactedConnectionPool could conceivably cause this, where e.g. the second open incorrectly returns a different connection. However, that doesn’t explain why that connection is broken, as indicated by exception…
I suspect the exception we’re seeing is the visible, end result of some earlier problem that we’re not seeing… It’s probably a good idea to concentrate on why a connection was broken in the first place, and ideally trying to step through the SqlClient code to see exactly what’s happening (not trivial as this is non-deterministic). Since this happens in .NET 6.0, I wouldn’t assume a Sys.Tx issue unless there’s a stronger indication.
I unfortunately am too taken up by 7.0 stabilization work at the moment to dive into this myself, but would be happy to follow along and try to help out…
This is interesting and I’d agree I didn’t look into that to be the possibility, so it certainly points to something missing out with
OpenAsync()
, which is very strange since both APIs ‘should’ follow the same path. 🤔I possibly was imagining .NET 7, but never mind. It’s still a case of delegated transaction as we’re looking at
System.Transactions.TransactionPromotionException
Let me dive into it and will get back.
@cheenamalhotra There are no distributed transactions in my LinqPad example. MSDTC escalation does not happen because only one connection-string is used. If you replace
await dbConn.OpenAsync();
withdbConn.Open();
(ie. merely switch from async to sync opening of a connection), this code works just fine (ie. even withts.Complete()
commented out).When the 1st connection-open request occurs, it should enlist that connection into the Transaction. Then 1st-connection-close occurs (but the inner-connection is not really closed - just returns to the pool). Then the 2nd connection-open request should receive the exact same already-open inner-connection back from the pool (ie. the same one durable resource is enlisted). This is exactly what happens when
dbConn.Open()
is used. There should be no observable behavior differences betweendbConn.Open();
andawait dbConn.OpenAsync();
- yet the async version crashes.@joostmeijles @sdrapkin
What you’re trying to do is unsafe and I don’t agree this is bug. Old behavior before v1.1.4 was a bug where accounting was not done for externally aborted transactions. I would request a read on https://github.com/dotnet/SqlClient/issues/237#issuecomment-569856610 and https://github.com/dotnet/SqlClient/issues/237#issuecomment-569965788 to understand the root cause.
How distributed transactions work is that they are decoupled from your application and are running within the MS DTC service. Which means if your app tries to be super-fast and expect MS DTC to keep coordinating with SQL Server for the MISSED commit/rollbacks, it needs time to do so, as it needs to abort the transaction on the server.
From driver end, if you closed a POOLED connection, server-side connection is kept open. So now if you happen to re-open the connection BEFORE MSDTC could terminate your previous transaction, you find that AFTER opening the connection, transaction gets aborted. It should never happen that the same OLD transaction continues to be used, and I hope you’d agree with me on that! So, someone must keep things in order. Either it’s your app, or the driver.
If you want driver to consistently terminate your connection alongwith Distributed transaction, DON’T use connection pool! If you want to use connection pool for performance needs, you MUST manage your transaction lifecycle and cannot leave them to be aborted by server.
Likewise, in the above repro, if you do call
ts.Complete()
things do just fine!I would certainly recommend documentation on this somewhere so it’s not a surprise to end users as the default behavior of driver is with pooling enabled, which is clearly not intuitive to users. cc @David-Engel
@DavoudEshtehari thanks for the update! Sounds as a good change, but unfortunately I am still getting an error, although its a different one:
@joostmeijles I confirm it’s reproducible with MDS and we’ll return to it after the next release. I’m afraid we can’t provide any estimation for a fix while it’s in the investigation stage. We’ll keep updating here with any findings, and we welcome any contribution as this is an open-source driver.