SqlClient: Reading large fields asynchronously is extremely slow
Reading a large (e.g. 5MB) string with SqlDataReader is 10 times slower than doing it synchronously (see BDN benchmark below); at the same time, it allocates much less memory. The same behavior reproduces both with CommandBehavior.Default and CommandBehavior.SequentialAccess, although I suspect there may be some interaction with the command behavior under the hood. Memory-wise, the expectation would be for SequentialAccess to use much less memory memory compared to the default command behavior - regardless of sync/async.
The tests were on run with .NET Core 3.0 on Ubuntu 19.04 against SQL Server running on localhost.
Related links:
- https://stackoverflow.com/a/28619983/640325, with interesting profiling and benchmarking on .NET Framework (from 2015) showing a very related but slightly different issue (only async + CommandBehavior.Default trigger the x10 slowdown)
- https://github.com/aspnet/EntityFrameworkCore/issues/18221, discussing this at the EF Core level.
Benchmark code:
[MemoryDiagnoser]
public class Benchmarks
{
const string ConnectionString = "...";
[Params(CommandBehavior.Default, CommandBehavior.SequentialAccess)]
public CommandBehavior Behavior { get; set; }
[GlobalSetup]
public void Setup()
{
using var conn = new SqlConnection(ConnectionString);
conn.Open();
using (var cmd = new SqlCommand("IF NOT EXISTS (SELECT * FROM sysobjects WHERE name='TextTable' AND xtype='U') CREATE TABLE [TextTable] ([Text] VARCHAR(MAX))", conn))
cmd.ExecuteNonQuery();
using (var cmd = new SqlCommand("INSERT INTO [TextTable] ([Text]) VALUES (@p)", conn))
{
cmd.Parameters.AddWithValue("p", new string('x', 1024 * 1024 * 5));
cmd.ExecuteNonQuery();
}
}
[Benchmark]
public async ValueTask<int> Async()
{
using var conn = new SqlConnection(ConnectionString);
using var cmd = new SqlCommand("SELECT [Text] FROM [TextTable]", conn);
await conn.OpenAsync();
await using var reader = await cmd.ExecuteReaderAsync(Behavior);
await reader.ReadAsync();
return (await reader.GetFieldValueAsync<string>(0)).Length;
}
[Benchmark]
public async ValueTask<int> Sync()
{
using var conn = new SqlConnection(ConnectionString);
using var cmd = new SqlCommand("SELECT [Text] FROM [TextTable]", conn);
conn.Open();
using var reader = cmd.ExecuteReader(Behavior);
reader.Read();
return reader.GetFieldValue<string>(0).Length;
}
}
Results:
BenchmarkDotNet=v0.11.5, OS=ubuntu 19.04
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
[Host] : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
Method | Behavior | Mean | Error | StdDev | Median | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|
Async | Default | 839.58 ms | 43.2838 ms | 127.6233 ms | 930.22 ms | 98000.0000 | 97000.0000 | 96000.0000 | 11.31 KB |
Sync | Default | 61.07 ms | 1.0988 ms | 1.0278 ms | 60.80 ms | 333.3333 | 222.2222 | 222.2222 | 10600.7 KB |
Async | SequentialAccess | 813.80 ms | 44.3059 ms | 130.6370 ms | 767.23 ms | 127000.0000 | 126000.0000 | 125000.0000 | 11.31 KB |
Sync | SequentialAccess | 69.54 ms | 0.6815 ms | 0.6375 ms | 69.45 ms | 375.0000 | 250.0000 | 250.0000 | 10672.5 KB |
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 9
- Comments: 20 (14 by maintainers)
Well there’s definitely a problem with how async reading of PLP data is done. It’s not a simple one though, it’s deeply rooted in how async works. I’ll explain what I’ve found so far because it’s as good a place as any to put the information for later reference.
A standard column value is of fixed length, easy to deal with. Non fixed width data has to be prefixed with it’s length and then possibly reconstructed from multiple chunks. There’s also an unknown length capacity where the user just has to keep requesting data until they reach and end of data packet. We’re dealing with the middle option. This data spans multiple packets and is formatted as partially length prefixed data with a known length.
When reading this in sync mode it’s pretty straight forward you just keep getting packets and pulling the contents of them out until the current offset in the byte buffer until you reach the end. Then you call the text encoding that corresponds to the encoding to convert your bytes into a string. Almost all of the memory overhead on sync is this string conversion. The only interruptions that are possible are internally generated.
In async mode things get complicated. Because any async method has to be interruptible by cancellation tokens and external events you have to handle state storage until a resolution to the read has been reached. This mechanism is snapshots. All async operations create a snapshot of the current reader state as their first action. Data is then requested and when received it is appended to the snapshot packet list and the snapshot is replayed to see if it can reach a resolution.
When you start reading the data from the first packet in the snapshot the data is identified as being part of a larger array of known length. An array is allocated for the full length and and attempt is made to read all the data, this fails because we don’t have all the data (most of the time). Once it fails the current packet is consumable but not consumed so we can’t discard it and it’s left in the snapshot and we wait for more data.
Hopefully you’ve just seen the problem. For every packet the arrives all packets in the unresolved chain are re-processed in the snapshot replay and that reprocessing allocates an array large enough for all the output every time. so divide the size of the data 5Mib by roughly 8K chunks for default packet size and you’ll have that number -1 dropped 5Mib arrays. The profile traces are quite impressive.
I’ve already got one PR in progress that optimizes the mechanics of snapshots. It won’t help here. This is a logical problem. What I’d like to understand is why it was chosen to implement a full replay each time a packet arrives rather than trying to keep some additional state in the snapshot, as I see it there is no branch capability in the replay the only possible results are {Success, Failure, Continue} and if that’s the case and backtracking can’t happen it would be possible to augment snapshots with continuation state To know this you’ll need to find someone who was around at the time or implemented this functionality because they’ll know what the limiting factor was here, there could well be a hidden requirement I can’t divine from the code,.
This is still a major issue.
I have been troubleshooting spikes in CPU and slow SQL statements only to come to this thread.
GetClaimsAsync in Microsoft.AspNetCore.Identity calls SQL that has a table of nvarchar(max) for claims and values.
Changing the table to nvarchar(255) seem to resolve the issue.
This issue is having a knock on affect to other libraries 😦
Here we go, from the release notes for BDN 0.12.0:
I’ll reach out to the BDN folks, but I’d tend to trust the new results by default unless we find out otherwise…
@Wraith2 can you share your project that you used to benchmark this issue?
I had a chat with @roji and we are seeing worse benchmarks than were originally reported.
We are wondering how the original results were obtained? @Roji is going to backtrack to the version of SqlClient that he may have run the original benchmarks on.
@Wraith2 in the meanwhile, can you provide your repro project. In case this is a regression in the version of SqlClient that we have been testing with, we need to remove the regression first.
The repos we have used are https://github.com/roji/SqlClientAsyncBenchmark and https://github.com/saurabh500/sqlclientrepros/tree/master/asyncperf
I noted in my PR https://github.com/dotnet/SqlClient/pull/285 that BDN wasn’t showing the correct memory figures and used dotMemory results to show the difference made by my changes. The new numbers you give look much more reasonable.
I suspect the first iteration of the async will have used sync over async, in which case you consume an additional thread but get identical (fairly good) performance to the sync version. I’d have to find an original version from that time and dive into the IL to verify that and i’m not sure it helps to do so, I’d just accept that perf used to be better and get on with improving it for the future.
All the variable length fields varchar, varbinary, image ntext etc seem to go down the same route to get the bytes and then if needed they’re transformed once the entire contents is known. That places an upper limit on the size of those fields which is worked around with stream capability like GetTextReader as you mentioned which bypasses buffering the whole thing, that’s the non-prefixed variable length i mentioned.
Yes. But the snapshot is present for good reasons because some things like attn and envchange need to processed on the real current state even when recieved in a snapshot context. There’s plenty of async closure stuff going on in the library and that might even be part of the problem here.
If i squint a bit i can sort of see how this can be corrected to be a linear operation. It’s going to get a bit messy because the snapshot functionality is integrated with a couple of different layers but there’s no doubt it’s worth doing. I would like some confirmation of the no backtracking possibliity when processing a plp chunk stream though.
Thanks for the detailed info @Wraith2, it’s indeed quite interesting and raises some questions.
Two immediate comments that I have:
I can’t speak about intent. I can tell you how the code behaves when that option is set though. The most relevant behaviour is that it specifically maintains only the currently read column
SqlBuffer
clearing out previous ones are reads are done, so you can’t step back. That does align with the purpose of reading large values. I don’t remember the de-packetization chunk handling code being particularly specialized.I’m setting up for some benching and I’ll see what I can find and if there’s any room for improvement.