SqlClient: Async reader may return wrong results on MacOS, Linux, WSL, Docker

Describe the bug

Sometimes the query may return wrong results under high load, on Non-Windows clients.

To reproduce

.NET Core
MSSQL instance
Client is not running on Windows (reproduced on MacOS or Linux)
Query uses async reader methods
High load that saturates the connection pool and causes timeouts.

Bug is seen more often when MultipleActiveResultSets=True (MARS) is in the connection string. It might be prevented by disabling MARS (unclear at this point).

The issue is hard to reproduce and rare, therefore this project was created with the code that attempts to simulate the situation.

This program starts 2000 concurrent connections that runs select @Id as Id statement. Each Id is different, and the query result should always return the id that was queried, yet under certain circumstances that’s not the case. It looks like sometimes the reader returns the stale data from random previous connection that experienced the timeout.

There are more details in the linked project.

Network latency can be more easily simulated using this method here: https://github.com/trainline/SqlClient659

Expected behavior

Under no circumstances a select statement should return a different result.

Further technical details

Microsoft.Data.SqlClient version: 4.8.1 or earlier .NET target: netcoreapp2.0 - netcoreapp3.1 SQL Server version: SQL Server Web Edition 13.00.5426.0.v1, SQL Server Express Edition 14.00.3281.6.v1 Operating system: macOS 10.15.5, Docker container

Workaround

Do not use this SQL client on linux/docker.

Always check the id of the returned result and retry or crash if it is wrong.

To decrease the likelihood of this or maybe prevent it (unclear at this point), do not use MARS feature, if you are using .NET Core with MSSQL database, ensure that the connection string does not contain MultipleActiveResultSets=True.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 12
Comments: 44 (23 by maintainers)

Commits related to this issue

Backport SqlClient 659 to release/3.1: Fix async timeout design leading to intermittent wrong data Backport of https://github.com/dotnet/SqlClient/issues/659 to 3.1 Cherry-picks: - https://github.co... — committed to vseanreesermsft/corefx by cheenamalhotra 2 years ago
Backport SqlClient 659 to release/3.1: Fix async timeout design leading to intermittent wrong data Backport of https://github.com/dotnet/SqlClient/issues/659 to 3.1 Cherry-picks: - https://github.co... — committed to CryptoPro/corefx by cheenamalhotra 2 years ago

Most upvoted comments

The latest Microsoft.Data.SqlClient v2.1.2 has been released containing fix for this issue. Closing as resolved.

cheenamalhotra on Mar 6, 2021

The underlying bug that was found was caused by threadpool exhaustion. It is a bug in the TdsParser layer which is above the network layer so it was possible to cause the problem in any version of the driver but was much more likely in managed (linux, macos) mode and with mars due to the way they use threads.

The issue itself was down to timing of internal cancellation messages and how they were implemented which made it possible for a timeout to arrive after a result had returned, or even worse after another query had begun. What the PR does is harden the timeout handling by adding synchronous timeout checks and carefully tracking the origin of the async timeout so that if it doesn’t match the executing query it is discarded. This results in more accurate timeouts in high contention scenarios and prevention of cross query timeouts which were causing a strange mismatch in the driver.

Wraith2 on Feb 22, 2021

I now have a consistent repro of this issue to rule out inconsistencies, as it’s always been difficult to consistently reproduce this. The issue is exaggerated with Threadpool exhaustion and instead of exhausting resources everytime, we are able to put a small delay on one of the delegates we know are causing problem here and are able to reproduce the wrong data use case.

Link to Repro

To run this repro, you’d need to reference driver source project and update OnTimeout function with a small delay period of 10 seconds, so it looks like this:

private void OnTimeout(object state)
    {
       Thread.Sleep(10000); // Include this line
       if (!_internalTimeout)
       {
           ...

The output captured is as under:

While we’re trying our best to fix this by various methods, things are very flaky and design is highly complex. Being able to reproduce consistently does give some hope. But there may be more cases where wrong data could be possible. Will provide more update as we identify more scenarios and solve this one.

cheenamalhotra on Feb 10, 2021

@Wraith2 , I’ve spend some time testing the pull request with my own repro-case.

Earlier on, I was able to get the incorrect results with the general version 2.1.1. Consistent wrong results over the different attempts. For the sake of completeness; the same wrong results with the general version 2.1.0.

Next, I’ve tried Microsoft.Data.SqlClient.2.1.0-pull-3bdbb71.21048.9. This version never gave any incorrect results, so it looks like this fix solves the problem. 😃

patrickbouwens1978 on Feb 22, 2021

Thank you all for the brilliant work to track down this elusive bug!

alefranz on Mar 6, 2021

Hello!

I’ve created a new repro putting together items of this discussion using docker-compose so it can be easily executed locally also simulating network failures.

https://github.com/trainline/SqlClient659

I can confirm the problem can be reproduced even without MARS, although it is less frequent.

.NET version	SQL Server Version	MARS	Failures rate
.NET Core 3.1	2017 Enterprise	Yes	High
.NET Core 3.1	2017 Enterprise	No	Low
.NET 5.0	2017 Enterprise	Yes	High
.NET 5.0	2017 Enterprise	No	Low
.NET Core 3.1	2019 Enterprise	Yes	High
.NET Core 3.1	2019 Enterprise	No	Low
.NET 5.0	2019 Enterprise	Yes	High
.NET 5.0	2019 Enterprise	No	Low

@cheenamalhotra Any suggestion on how to try to pinpoint the origin of the problem? Is there a diagnostic source available in the SqlClient?

alefranz on Dec 7, 2020

Some more background information about my earlier repro-case, for reference.

The reprocase ran as a Docker-container in a Microsoft Azure environment, on a Linux App Service Plan (P1V2 and P2V2, both used for testing) and SQL Server Azure (S0) The base dockerfile is mcr.microsoft.com/dotnet/core/aspnet:3.1.8-bionic, so non-windows environment on .Net core 3.1 Connectionstring didn’t include any MARS-references, so all is used in a non-MARS environment.

The base situation is a full async operation, so openasync, executereaderasync, and readasync. I’ve never used transactions in the scenario’s, but used a low commandtimeout (1 second) to enforce problems as much as possible…

The SQL Statement I used is “select @Id as Id”

The issue of the wrong results only occurred when a “Timeout expired exception” appeared. Scaling up / down the database also causes exceptions, but didn’t trigger any wrong results (or at least, I couldn’t manage to do so).

The results: Base scenario: use of openasync, executereaderasync, and readasync, high load situation with multiple threads, multiple requests. When things went wrong, they kept going wrong in a consistent matter. For instance, suppose there are three users: User 1: Select * FROM User where UserId = 1 User 2: Select * FROM User where UserId = 2 User 3: Select * FROM User where UserId = 3 User 1 caused an exception. User 2 received the results from User 1, User 3 received the results from User 2, and so on. Different requests got each other’s results, even after some time. More and more SQL statements results in wrong results, even after some time. The connection pool become corrupted, and stayed corrupted. The only solution was to restart the web app.

Mitigations: I tried a number of mitigations.

Attempt 1: Use ConfigureAwait(false) for openasync, executereaderasync, readasync. No effect, wrong results keep appearing like in the base scenario.

Attempt 2: Use .Open() instead of OpenAsync() Exception are occurring more often (due to Timeout expired), but only a few wrong results. (10 out of 5000 of the statements went wrong, instead of 4000+ out of 5000 in the base scenario) Furthermore, the wrong results weren’t consistly wrong. New requests could be ok, so no restart of the application was required. So it looked like some connetions became corrupt, but somehow, it didn’t influence the connectionpool for later connections.

Attempt 3: Change the Max Connection Pool Size No noticeable effect.

Performance was, however, reduced significly.

Results with the patch: In all situation, no wrong results have occurred.

In summary: The only mitigation I could find, was to:

Reduce load, so the chance of Timeout Expired exception was reduced
Use Open instead of OpenAsync, so the chance of wrong results was reduced as well, and wrong results could ‘disappear’ in future requests. I’ve had some luck using an O/R mapper, so I could change to use Open instead of OpenAsync with just a couple of lines of code.

The patch solves the situation of wrong results appearing. So looking forward for this hotfix to be released… 😉

patrickbouwens1978 on Feb 22, 2021

was this bug also affecting .NET 5 apps running on Windows ?

The problem could affect all versions of the library running on any version of .net. It was more likely on the managed driver but the root cause was shared. It was a timing issue.

is System.Data.SqlClient also affected

Yes, every version of SqlClient prior to the fix could be affected. It doesn’t look like the fix was backported to corefx so the System version retains the ability to encounter the problem https://github.com/dotnet/corefx/pulls?q=is%3Apr+is%3Aclosed+label%3Aarea-System.Data.SqlClient I don’t know whether servicing was considered but given the rarity of the problem and the difficulty of the fix i can see why it wouldn’t be be an obvious choice to push back to an LTS release.

Wraith2 on Oct 5, 2021

Hi @cheenamalhotra , @Wraith2 ,

as you know reproducing the issue, especially without MARS, on real world applications is extremely challenging, so it is hard to prove the fix. I can confirm however that it doesn’t introduce any appreciable regression.

Thanks for the great effort on this!

alefranz on Feb 22, 2021

This fix applies to SqlCommand layer and to Begin and End methods of Execute Reader Async so Mars was not a factor, and neither is Managed SNI.

We also know ExecuteXmlReaderAsync also has same issue, I will apply the same fix there on Monday.

cheenamalhotra on Jan 30, 2021

Set Pooling = false

MultipleActiveResultSets=True;Pooling=false; painfully slow and many timeouts, but no wrong results, which is good

Removing the connection that encountered an exception

            SqlConnection cnp = null;
            try
            {
                using (var cn = new SqlConnection(connectionString))
                {
                    cnp = cn;
                    
                    // ...
                }
            }
            catch (Exception e)
            {
                if (cnp != null)
                {
                    SqlConnection.ClearPool(cnp);
                }
                
                // ...
            }

Got wrong results on a third attempt.

Attempting to execute a SQL command without attaching a transaction to the command

This was attempted in production when it was still unclear if this is a MARS issue. It makes wrong results more rare, but they still happen. I was not able to reproduce it in this test project with a simplistic select statement.

Using ExecuteScalarAsync instead of ExecuteReaderAsync

Tried it 5 times and wrong results did not happen.

@DavoudEshtehari I hope this helps.

Nercury on Sep 17, 2020

It could occur in all versions of SqlClient prior to the fix in 2.1.2

Correction: Fix has also been released in v1.1.4

Wraith2 on Mar 22, 2021

We’re very keen @Wraith2, just need to find a moment having spent engineering time moving linux services back to windows to mitigate the issue.

simonjduff on Feb 19, 2021

Hi Everyone!

We’d like to request community validations on the fix from PR #906. We’ve captured the root cause and fixed the issue in the core parser layer.

Please give the below NuGet package a try and send us your feedback: Microsoft.Data.SqlClient.2.1.0-pull-89d6662.21043.7 from Pipeline Artifacts

cheenamalhotra on Feb 18, 2021

That’s great news, I was really struggling to find a good repro for the non-MARS scenario! I will start testing this build next week.

alefranz on Jan 30, 2021

Hi everyone!

Thank you for your patience, we seem to have a fix available to try. Would request you to please try out the PR https://github.com/dotnet/SqlClient/pull/894 and let us know!

Download Build CI NuGet [Link Updated for final changes | 04-02-2021]

cheenamalhotra on Feb 4, 2021

Hi Everyone!

Providing this week’s progress so far:

We have managed to capture and study all detailed traces for this behavior, including Event Source, SQL Server Profiler, and Wireshark to strip open full story which I’ll detail below:

Fact we know: When an exception occurs on Client side, client shall send an Attention packet to the server to immediately stop processing further data and to not send any further data. On the client side, sending attention can mean something too, as for example ensuring transaction is rollbacked and data at rest is cleared properly or whether connection should be killed and not reused. It depends on when and how we receive an exception and send attention.

Problem we see here: We have 2 situations going on when exception “Execution Timeout Expired” is to be received, and it’s looking like there’s a race condition between Transaction Commit and Attention when Read timeout occurs in SNIMarsHandle. More precisely, the race condition is between throwing an Exception vs providing results.

[GOOD FLOW] Begin Trans > Run Query > Sends Attention > Trans Rollback
- Connection throws exception and terminates safely.
[BAD FLOW] Begin Trans > Run Query > Trans Commit > Sends Attention
- Query succeeds in this case as results were provided and commit succeeded but when connection is reused from the pool, the next query throws “Execution Timeout Expired” exception (not consumed in first round) followed by a rollback but data is not waited upon (hence we saw data coming after this connection was closed - as in my previous comment) and not cleared up > leading to wrong data flow.

We know this is related to client side timeout that occurs during Read as Read operations are blocked due to thread starvation, so even though server sends data on time, client does not read it on time. But, even if we manage to fix the thread starvation issue, that is not the solution to this particular issue, as the BAD FLOW is the real problem and can occur in any similar situation.

Will provide more updates as we continue to study our traces to dig deeper and find a solution.

cheenamalhotra on Jan 22, 2021

While I was preparing the requested changes I found out it is happening again. Maybe the network conditions simultaneously had affected my test by the last change. The issue intermittently happens either with System.Data.Sqlclient & Microsoft.Data.SqlClient when ExecuteReaderAsync blends with exceptions.

DavoudEshtehari on Sep 8, 2020

I didn’t get any Wrong results as you expected.

Do you have any comments?

Yes, this matches my experience. As I mentioned, the issue seems to be related to timeouts, and they can not happen if the DB server and the code is on the same machine, and the query is doing basically nothing.

Try to introduce some latency between the db server and the application.

I reproduced it using a bit crude approach, by running DB server in cloud and the application on my laptop with a router in another room (so there is some considerable latency between the two). Maybe there are tools to do it a bit more cleanly.

In production environment the AWS starts throttling read operations when they go above allowed IOPS limit, that’s how it originally happened when running on AWS.

Nercury on Aug 18, 2020