SqlClient: Reading large fields asynchronously is extremely slow

Reading a large (e.g. 5MB) string with SqlDataReader is 10 times slower than doing it synchronously (see BDN benchmark below); at the same time, it allocates much less memory. The same behavior reproduces both with CommandBehavior.Default and CommandBehavior.SequentialAccess, although I suspect there may be some interaction with the command behavior under the hood. Memory-wise, the expectation would be for SequentialAccess to use much less memory memory compared to the default command behavior - regardless of sync/async.

The tests were on run with .NET Core 3.0 on Ubuntu 19.04 against SQL Server running on localhost.

Method	Behavior	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
Async	Default	839.58 ms	43.2838 ms	127.6233 ms	930.22 ms	98000.0000	97000.0000	96000.0000	11.31 KB
Sync	Default	61.07 ms	1.0988 ms	1.0278 ms	60.80 ms	333.3333	222.2222	222.2222	10600.7 KB
Async	SequentialAccess	813.80 ms	44.3059 ms	130.6370 ms	767.23 ms	127000.0000	126000.0000	125000.0000	11.31 KB
Sync	SequentialAccess	69.54 ms	0.6815 ms	0.6375 ms	69.45 ms	375.0000	250.0000	250.0000	10672.5 KB

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 9
Comments: 20 (14 by maintainers)

Most upvoted comments

Well there’s definitely a problem with how async reading of PLP data is done. It’s not a simple one though, it’s deeply rooted in how async works. I’ll explain what I’ve found so far because it’s as good a place as any to put the information for later reference.

A standard column value is of fixed length, easy to deal with. Non fixed width data has to be prefixed with it’s length and then possibly reconstructed from multiple chunks. There’s also an unknown length capacity where the user just has to keep requesting data until they reach and end of data packet. We’re dealing with the middle option. This data spans multiple packets and is formatted as partially length prefixed data with a known length.

When reading this in sync mode it’s pretty straight forward you just keep getting packets and pulling the contents of them out until the current offset in the byte buffer until you reach the end. Then you call the text encoding that corresponds to the encoding to convert your bytes into a string. Almost all of the memory overhead on sync is this string conversion. The only interruptions that are possible are internally generated.

In async mode things get complicated. Because any async method has to be interruptible by cancellation tokens and external events you have to handle state storage until a resolution to the read has been reached. This mechanism is snapshots. All async operations create a snapshot of the current reader state as their first action. Data is then requested and when received it is appended to the snapshot packet list and the snapshot is replayed to see if it can reach a resolution.

When you start reading the data from the first packet in the snapshot the data is identified as being part of a larger array of known length. An array is allocated for the full length and and attempt is made to read all the data, this fails because we don’t have all the data (most of the time). Once it fails the current packet is consumable but not consumed so we can’t discard it and it’s left in the snapshot and we wait for more data.

Hopefully you’ve just seen the problem. For every packet the arrives all packets in the unresolved chain are re-processed in the snapshot replay and that reprocessing allocates an array large enough for all the output every time. so divide the size of the data 5Mib by roughly 8K chunks for default packet size and you’ll have that number -1 dropped 5Mib arrays. The profile traces are quite impressive.

I’ve already got one PR in progress that optimizes the mechanics of snapshots. It won’t help here. This is a logical problem. What I’d like to understand is why it was chosen to implement a full replay each time a packet arrives rather than trying to keep some additional state in the snapshot, as I see it there is no branch capability in the replay the only possible results are {Success, Failure, Continue} and if that’s the case and backtracking can’t happen it would be possible to augment snapshots with continuation state To know this you’ll need to find someone who was around at the time or implemented this functionality because they’ll know what the limiting factor was here, there could well be a hidden requirement I can’t divine from the code,.

Wraith2 on Oct 6, 2019

This is still a major issue.

I have been troubleshooting spikes in CPU and slow SQL statements only to come to this thread.

GetClaimsAsync in Microsoft.AspNetCore.Identity calls SQL that has a table of nvarchar(max) for claims and values.

Changing the table to nvarchar(255) seem to resolve the issue.

This issue is having a knock on affect to other libraries 😦

KonnectSalon on Apr 4, 2022

Here we go, from the release notes for BDN 0.12.0:

Improved MemoryDiagnoser Now MemoryDiagnoser includes memory allocated by all threads that were live during benchmark execution: a new GC API was exposed in .NET Core 3.0 preview6+. It allows to get the number of allocated bytes for all threads. #1155, #1153, #723

I’ll reach out to the BDN folks, but I’d tend to trust the new results by default unless we find out otherwise…

roji on Nov 6, 2019

@Wraith2 can you share your project that you used to benchmark this issue?

I had a chat with @roji and we are seeing worse benchmarks than were originally reported.

| Method |         Behavior |       Mean |    Error |    StdDev |     Median |       Gen 0 |       Gen 1 |       Gen 2 |  Allocated |
|------- |----------------- |-----------:|---------:|----------:|-----------:|------------:|------------:|------------:|-----------:|
|  Async |          Default |   961.5 ms | 35.20 ms | 103.80 ms | 1,007.5 ms | 105000.0000 | 104000.0000 | 104000.0000 | 3661.01 MB |
|   Sync |          Default |   893.9 ms |  3.09 ms |   2.89 ms |   893.8 ms |           - |           - |           - |      15 MB |
|  Async | SequentialAccess | 1,888.8 ms | 42.94 ms | 126.60 ms | 1,929.5 ms | 173000.0000 | 172000.0000 | 172000.0000 | 3525.79 MB |
|   Sync | SequentialAccess | 1,878.2 ms | 37.43 ms |  71.21 ms | 1,865.5 ms |           - |           - |           - |      15 MB |

We are wondering how the original results were obtained? @Roji is going to backtrack to the version of SqlClient that he may have run the original benchmarks on.

@Wraith2 in the meanwhile, can you provide your repro project. In case this is a regression in the version of SqlClient that we have been testing with, we need to remove the regression first.

The repos we have used are https://github.com/roji/SqlClientAsyncBenchmark and https://github.com/saurabh500/sqlclientrepros/tree/master/asyncperf

saurabh500 on Nov 6, 2019

I noted in my PR https://github.com/dotnet/SqlClient/pull/285 that BDN wasn’t showing the correct memory figures and used dotMemory results to show the difference made by my changes. The new numbers you give look much more reasonable.

Wraith2 on Nov 6, 2019

I suspect the first iteration of the async will have used sync over async, in which case you consume an additional thread but get identical (fairly good) performance to the sync version. I’d have to find an original version from that time and dive into the IL to verify that and i’m not sure it helps to do so, I’d just accept that perf used to be better and get on with improving it for the future.

All the variable length fields varchar, varbinary, image ntext etc seem to go down the same route to get the bytes and then if needed they’re transformed once the entire contents is known. That places an upper limit on the size of those fields which is worked around with stream capability like GetTextReader as you mentioned which bypasses buffering the whole thing, that’s the non-prefixed variable length i mentioned.

Am I understanding you correctly that reader state is manually snapshotted/managed for async operations?

Yes. But the snapshot is present for good reasons because some things like attn and envchange need to processed on the real current state even when recieved in a snapshot context. There’s plenty of async closure stuff going on in the library and that might even be part of the problem here.

If i squint a bit i can sort of see how this can be corrected to be a linear operation. It’s going to get a bit messy because the snapshot functionality is integrated with a couple of different layers but there’s no doubt it’s worth doing. I would like some confirmation of the no backtracking possibliity when processing a plp chunk stream though.

Wraith2 on Oct 7, 2019

Thanks for the detailed info @Wraith2, it’s indeed quite interesting and raises some questions.

Two immediate comments that I have:

As I wrote above, the Core version seems to be a regression compared to the Framework version - at least a some point, according to this SO, async sequential was apparently fast. It may be a good idea to run the benchmark above on .NET Framework to confirm, and then to see what’s being done differently in the Framework version.
At least for byte arrays (blobs), things really should be quite simple. Assuming values are length-prefixed (which seems to be the case according to your description), a byte array of the suitable size should simply be allocated and filled gradually with no extra memory overhead or needless copying. I understand the current architecture is very different, am just stating the obvious in how things should work.
Strings, on the other hand, are considerably more complicated as decoding has to take place. In Npgsql, if the string is really large, an extra byte array is allocated for the (byte) length of the string being read. Like with byte arrays, it’s simply filled gradually, and decoded once at the end. This is probably not ideal (at minimum the temporary byte array should be pooled), but I doubt that this API could function much better for long strings (where GetTextReader is more suited). Here’s the source code.
Am I understanding you correctly that reader state is manually snapshotted/managed for async operations? That would seem to indicate that the compiler async/await feature isn’t used in SqlClient? Because the whole point of that feature is to “snapshot” the local variables in the async function and restore them later when the callback completes… Am just curious.

roji on Oct 7, 2019

That makes sense for non-sequential, but isn’t the whole point of sequential access to stream large values? In other words,

I can’t speak about intent. I can tell you how the code behaves when that option is set though. The most relevant behaviour is that it specifically maintains only the currently read column SqlBuffer clearing out previous ones are reads are done, so you can’t step back. That does align with the purpose of reading large values. I don’t remember the de-packetization chunk handling code being particularly specialized.

I’m setting up for some benching and I’ll see what I can find and if there’s any room for improvement.

Wraith2 on Oct 5, 2019