SqlClient: Async reader may return wrong results on MacOS, Linux, WSL, Docker
Describe the bug
Sometimes the query may return wrong results under high load, on Non-Windows clients.
To reproduce
- .NET Core
- MSSQL instance
- Client is not running on Windows (reproduced on MacOS or Linux)
- Query uses async reader methods
- High load that saturates the connection pool and causes timeouts.
Bug is seen more often when MultipleActiveResultSets=True
(MARS) is in the connection string. It might be prevented by disabling MARS (unclear at this point).
The issue is hard to reproduce and rare, therefore this project was created with the code that attempts to simulate the situation.
This program starts 2000 concurrent connections that runs select @Id as Id
statement. Each Id
is different, and the query result should always return
the id that was queried, yet under certain circumstances that’s not the case. It looks like
sometimes the reader returns the stale data from random previous connection that experienced the timeout.
There are more details in the linked project.
Network latency can be more easily simulated using this method here: https://github.com/trainline/SqlClient659
Expected behavior
Under no circumstances a select statement should return a different result.
Further technical details
Microsoft.Data.SqlClient version: 4.8.1 or earlier .NET target: netcoreapp2.0 - netcoreapp3.1 SQL Server version: SQL Server Web Edition 13.00.5426.0.v1, SQL Server Express Edition 14.00.3281.6.v1 Operating system: macOS 10.15.5, Docker container
Workaround
Do not use this SQL client on linux/docker.
Always check the id of the returned result and retry or crash if it is wrong.
To decrease the likelihood of this or maybe prevent it (unclear at this point), do not use MARS feature, if you are using .NET Core with MSSQL database, ensure that the connection string does not contain MultipleActiveResultSets=True
.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 12
- Comments: 44 (23 by maintainers)
Commits related to this issue
- Backport SqlClient 659 to release/3.1: Fix async timeout design leading to intermittent wrong data Backport of https://github.com/dotnet/SqlClient/issues/659 to 3.1 Cherry-picks: - https://github.co... — committed to vseanreesermsft/corefx by cheenamalhotra 2 years ago
- Backport SqlClient 659 to release/3.1: Fix async timeout design leading to intermittent wrong data Backport of https://github.com/dotnet/SqlClient/issues/659 to 3.1 Cherry-picks: - https://github.co... — committed to CryptoPro/corefx by cheenamalhotra 2 years ago
The latest Microsoft.Data.SqlClient v2.1.2 has been released containing fix for this issue. Closing as resolved.
The underlying bug that was found was caused by threadpool exhaustion. It is a bug in the TdsParser layer which is above the network layer so it was possible to cause the problem in any version of the driver but was much more likely in managed (linux, macos) mode and with mars due to the way they use threads.
The issue itself was down to timing of internal cancellation messages and how they were implemented which made it possible for a timeout to arrive after a result had returned, or even worse after another query had begun. What the PR does is harden the timeout handling by adding synchronous timeout checks and carefully tracking the origin of the async timeout so that if it doesn’t match the executing query it is discarded. This results in more accurate timeouts in high contention scenarios and prevention of cross query timeouts which were causing a strange mismatch in the driver.
I now have a consistent repro of this issue to rule out inconsistencies, as it’s always been difficult to consistently reproduce this. The issue is exaggerated with Threadpool exhaustion and instead of exhausting resources everytime, we are able to put a small delay on one of the delegates we know are causing problem here and are able to reproduce the wrong data use case.
Link to Repro
To run this repro, you’d need to reference driver source project and update OnTimeout function with a small delay period of 10 seconds, so it looks like this:
The output captured is as under:
While we’re trying our best to fix this by various methods, things are very flaky and design is highly complex. Being able to reproduce consistently does give some hope. But there may be more cases where wrong data could be possible. Will provide more update as we identify more scenarios and solve this one.
@Wraith2 , I’ve spend some time testing the pull request with my own repro-case.
Earlier on, I was able to get the incorrect results with the general version 2.1.1. Consistent wrong results over the different attempts. For the sake of completeness; the same wrong results with the general version 2.1.0.
Next, I’ve tried Microsoft.Data.SqlClient.2.1.0-pull-3bdbb71.21048.9. This version never gave any incorrect results, so it looks like this fix solves the problem. 😃
Thank you all for the brilliant work to track down this elusive bug!
Hello!
I’ve created a new repro putting together items of this discussion using
docker-compose
so it can be easily executed locally also simulating network failures.https://github.com/trainline/SqlClient659
I can confirm the problem can be reproduced even without MARS, although it is less frequent.
@cheenamalhotra Any suggestion on how to try to pinpoint the origin of the problem? Is there a diagnostic source available in the SqlClient?
Some more background information about my earlier repro-case, for reference.
The reprocase ran as a Docker-container in a Microsoft Azure environment, on a Linux App Service Plan (P1V2 and P2V2, both used for testing) and SQL Server Azure (S0) The base dockerfile is mcr.microsoft.com/dotnet/core/aspnet:3.1.8-bionic, so non-windows environment on .Net core 3.1 Connectionstring didn’t include any MARS-references, so all is used in a non-MARS environment.
The base situation is a full async operation, so openasync, executereaderasync, and readasync. I’ve never used transactions in the scenario’s, but used a low commandtimeout (1 second) to enforce problems as much as possible…
The SQL Statement I used is “select @Id as Id”
The issue of the wrong results only occurred when a “Timeout expired exception” appeared. Scaling up / down the database also causes exceptions, but didn’t trigger any wrong results (or at least, I couldn’t manage to do so).
The results: Base scenario: use of openasync, executereaderasync, and readasync, high load situation with multiple threads, multiple requests. When things went wrong, they kept going wrong in a consistent matter. For instance, suppose there are three users: User 1: Select * FROM User where UserId = 1 User 2: Select * FROM User where UserId = 2 User 3: Select * FROM User where UserId = 3 User 1 caused an exception. User 2 received the results from User 1, User 3 received the results from User 2, and so on. Different requests got each other’s results, even after some time. More and more SQL statements results in wrong results, even after some time. The connection pool become corrupted, and stayed corrupted. The only solution was to restart the web app.
Mitigations: I tried a number of mitigations.
Attempt 1: Use ConfigureAwait(false) for openasync, executereaderasync, readasync. No effect, wrong results keep appearing like in the base scenario.
Attempt 2: Use .Open() instead of OpenAsync() Exception are occurring more often (due to Timeout expired), but only a few wrong results. (10 out of 5000 of the statements went wrong, instead of 4000+ out of 5000 in the base scenario) Furthermore, the wrong results weren’t consistly wrong. New requests could be ok, so no restart of the application was required. So it looked like some connetions became corrupt, but somehow, it didn’t influence the connectionpool for later connections.
Attempt 3: Change the Max Connection Pool Size No noticeable effect.
Performance was, however, reduced significly.
Results with the patch: In all situation, no wrong results have occurred.
In summary: The only mitigation I could find, was to:
The patch solves the situation of wrong results appearing. So looking forward for this hotfix to be released… 😉
The problem could affect all versions of the library running on any version of .net. It was more likely on the managed driver but the root cause was shared. It was a timing issue.
Yes, every version of SqlClient prior to the fix could be affected. It doesn’t look like the fix was backported to corefx so the System version retains the ability to encounter the problem https://github.com/dotnet/corefx/pulls?q=is%3Apr+is%3Aclosed+label%3Aarea-System.Data.SqlClient I don’t know whether servicing was considered but given the rarity of the problem and the difficulty of the fix i can see why it wouldn’t be be an obvious choice to push back to an LTS release.
Hi @cheenamalhotra , @Wraith2 ,
as you know reproducing the issue, especially without MARS, on real world applications is extremely challenging, so it is hard to prove the fix. I can confirm however that it doesn’t introduce any appreciable regression.
Thanks for the great effort on this!
This fix applies to SqlCommand layer and to Begin and End methods of Execute Reader Async so Mars was not a factor, and neither is Managed SNI.
We also know ExecuteXmlReaderAsync also has same issue, I will apply the same fix there on Monday.
Set Pooling = false
MultipleActiveResultSets=True;Pooling=false;
painfully slow and many timeouts, but no wrong results, which is goodRemoving the connection that encountered an exception
Got wrong results on a third attempt.
Attempting to execute a SQL command without attaching a transaction to the command
This was attempted in production when it was still unclear if this is a MARS issue. It makes wrong results more rare, but they still happen. I was not able to reproduce it in this test project with a simplistic select statement.
Using ExecuteScalarAsync instead of ExecuteReaderAsync
Tried it 5 times and wrong results did not happen.
@DavoudEshtehari I hope this helps.
It could occur in all versions of SqlClient prior to the fix in 2.1.2
Correction: Fix has also been released in v1.1.4
We’re very keen @Wraith2, just need to find a moment having spent engineering time moving linux services back to windows to mitigate the issue.
Hi Everyone!
We’d like to request community validations on the fix from PR #906. We’ve captured the root cause and fixed the issue in the core parser layer.
Please give the below NuGet package a try and send us your feedback: Microsoft.Data.SqlClient.2.1.0-pull-89d6662.21043.7 from Pipeline Artifacts
That’s great news, I was really struggling to find a good repro for the non-MARS scenario! I will start testing this build next week.
Hi everyone!
Thank you for your patience, we seem to have a fix available to try. Would request you to please try out the PR https://github.com/dotnet/SqlClient/pull/894 and let us know!
Download Build CI NuGet [Link Updated for final changes | 04-02-2021]
Hi Everyone!
Providing this week’s progress so far:
We have managed to capture and study all detailed traces for this behavior, including Event Source, SQL Server Profiler, and Wireshark to strip open full story which I’ll detail below:
Fact we know: When an exception occurs on Client side, client shall send an Attention packet to the server to immediately stop processing further data and to not send any further data. On the client side, sending attention can mean something too, as for example ensuring transaction is rollbacked and data at rest is cleared properly or whether connection should be killed and not reused. It depends on when and how we receive an exception and send attention.
Problem we see here: We have 2 situations going on when exception “Execution Timeout Expired” is to be received, and it’s looking like there’s a race condition between Transaction Commit and Attention when Read timeout occurs in SNIMarsHandle. More precisely, the race condition is between throwing an Exception vs providing results.
We know this is related to client side timeout that occurs during Read as Read operations are blocked due to thread starvation, so even though server sends data on time, client does not read it on time. But, even if we manage to fix the thread starvation issue, that is not the solution to this particular issue, as the BAD FLOW is the real problem and can occur in any similar situation.
Will provide more updates as we continue to study our traces to dig deeper and find a solution.
While I was preparing the requested changes I found out it is happening again. Maybe the network conditions simultaneously had affected my test by the last change. The issue intermittently happens either with
System.Data.Sqlclient
&Microsoft.Data.SqlClient
whenExecuteReaderAsync
blends with exceptions.Yes, this matches my experience. As I mentioned, the issue seems to be related to timeouts, and they can not happen if the DB server and the code is on the same machine, and the query is doing basically nothing.
Try to introduce some latency between the db server and the application.
I reproduced it using a bit crude approach, by running DB server in cloud and the application on my laptop with a router in another room (so there is some considerable latency between the two). Maybe there are tools to do it a bit more cleanly.
In production environment the AWS starts throttling read operations when they go above allowed IOPS limit, that’s how it originally happened when running on AWS.