SqlClient: CancellationToken registration leak, regression
Describe the bug
After updating to SqlClient 5.0.1, we observe a memory leak in our application. Memory dumps show a huge amount of CancellationTokenSource+CallbackNode entries (we had a dump with 2 GB of those), referencing SqlCommand objects.
We believe this might be due to the changes in:
- https://github.com/dotnet/SqlClient/pull/1785 for 5.0.1 backport
- https://github.com/dotnet/SqlClient/pull/1781 for 5.1.0
With the changes in these PRs, it seems the CancellationToken registration
is not always disposed. There are several if
blocks that may return before the registration
object gets to be saved in the context
. And exceptions could also be thrown.
To reproduce
using Microsoft.Data.SqlClient; // 5.0.1
using System.Threading;
using System.Threading.Tasks;
public static class Program
{
public static async Task Main()
{
const string query = "SELECT 1";
// this could be the application lifetime cancellation
// or another form of global shared cancellation
using var longRunningCts = new CancellationTokenSource();
var connectionString = new SqlConnectionStringBuilder
{
DataSource = "(local)",
IntegratedSecurity = true,
TrustServerCertificate = true,
}.ToString();
try
{
using (var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(longRunningCts.Token))
using (var connection = new SqlConnection(connectionString))
{
await connection.OpenAsync(linkedCts.Token);
using (var command = new SqlCommand(query, connection))
using (var reader = await command.ExecuteReaderAsync(linkedCts.Token))
{
linkedCts.Cancel();
while (await reader.ReadAsync(linkedCts.Token))
{
}
}
}
}
catch (Exception e)
{
}
// => attach a debugger here
// linkedCts as well as all the SQL objects have been disposed above
// however a memory snapshot shows that the CancellationTokenSource+Registrations and CancellationTokenSource+CallbackNode are still there, including the objects they reference
// so garbage collection cannot happen correctly
}
}
Expected behavior
No memory leak, CancellationTokenSource+CallbackNode entries are not present in a dump.
Further technical details
Microsoft.Data.SqlClient version: 5.0.1 .NET target: .NET 6.0.3 SQL Server version: SQL Server 2019 Operating system: Windows 2019
Additional context
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 18
- Comments: 34 (15 by maintainers)
I’m having this issue also. After moving to 5.0.1 my Linux server would leak memory rapidly until crashing. Analyzing a memory dump revealed millions of instances of SqlCommand rooted only by the cancellation token registration (CallbackNode). Even after just a couple hours, I can confirm that reverting to 5.0.0 has solved the problem. For an in-depth analysis, including dump screen shots, system configuration, and other information, please see this issue:
(Azure App Service - Linux - Memory Working Set - last 30 days)
(Azure App Service - Linux - Memory Working Set - last 3 days)
I recommend marking 5.0.1 as unstable on NuGet for three reasons:
I think many out there (like me) who took way too long to discover this as the root cause are going to really appreciate upgrading to 5.1.0 when it is released.
I believe I have tracked down the problem. Any call to
ReadAsync
when there is enough data already read in the buffer to return a result without awaiting will cause an “unwanted retention”. This is due to the following code which short-circuits the async code to immediately return true if the next row can be read from the buffer:https://github.com/dotnet/SqlClient/blob/main/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs#L4800
This explains how there can be such drastic consequences to a few number of SQL calls – for example, this code might perhaps add a hundred or more “unwanted retentions” due to the high number of integer values that can be contained within in a single data packet from the server:
As before, I would strongly urge you to consider deprecating 5.0.1 immediately for the reasons stated previously. Below is a repro sample of the code necessary to demonstrate the “unwanted retentions”:
Sample:
Which produces the following:
With version 5.0.0:
With version 5.0.0-pull.55891:
As promised, here is the memory usage graph for 5.0.0-pull.55891:
PR #1818 seems to solve the problem, although I question the logic of repeatedly registering a cancellation callback only to immediately remove it (the logic introduced in #1781 ). It would seem that the cancellation callback should only be registered once the short-circuit logic has been evaluated. However, I have not done a full review of #1781 and #1065 .
So the plan is to leave 5.0.1 as-is with a memory leak (unwanted retention) that occurs for almost every call to ReadAsync, affecting probably every single user of this version of the library, and some critically so?
We are seeing crashing AKS K8S pods with OutOfMemory exceptions after upgrading to 5.0.1. Seems memory usage increases 10-20 MB for each processing run which involves efcore. You really can’t leave this bug and not fix it until 5.1.0 is out. You konw the reason and how to fix it, so please come together with the efcore guys and release both 5.0.2 and an updated ef 7 package the requires 5.0.2 at the same time. Thank you.
May I suggest to release a 5.0.2 anyway? The new EF 7 SQL Server package depends on 5.0.1 now. see Microsoft.EntityFrameworkCore.SqlServer
You might want to reconsider 😉
Afaik Nuget automatically resolves the lowest version of a package which fits the constraints, so until a new EF Core SqlServer Package is released referencing a new version of the client package (be it 5.0.2 or 5.1) everyone need to manually reference the fixed package.
Thanks @lcheunglci. FYI I edited the issue to add a code sample to reproduce the problem.
Yes, it looks like what is in now will be in 5.1: https://github.com/dotnet/SqlClient/projects/11
Yep, we’re very aware. I think it makes sense for an EF Core patch release to take a dependency on the SqlClient 5.0.2, when that’s released - we’ll be discussing that.
Sounds like a good idea. We’ll have to make sure the EFCore team are aware that they might get some problems with the current version and I’m sure they’ll pick up the later version as soon as it’s available
/cc @ajcvickers @roji
So fix confirmed is good. I’ll investigate the position of the registration to see if it can be moved without regressing the fix that moved it up but the order of operations is clearly important for catching some timing related behaviours so while I agree it would be nice to put it as low as possible that may not be practical.
I installed the nuget package and will report tomorrow with a memory usage graph. I can verify that it is loaded in memory on my production server as this code now returns “5.00-pull.55891” in the live environment:
It is included in preview2 milestone and hopefully will get merged for next preview release unless reviewers find some odd behaviors/regression in the changes.
There’s a fairly simple way to solve it using a mutable ref struct to hold the registration and then the context can Take() it if needed. If the struct has a value when it disposes then you dispose the value if not then it’s been taken and is now owned by the context which will clean it up.