SqlClient: Intermittent Unknown error 258 with no obvious cause

Describe the bug

On occasions we will see the following error

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)

However, SQL Server shows no long running queries and is not using a lot of it’s resources during the periods where this happens.

It looks to be more of an intermittent connection issue but we’re unable to find any sort of root cause.

To reproduce

We’re not sure of the reproduction steps. I’ve been unable to reproduce this myself by simulating load. From what we can tell this is more likely to happen when the pod is busy (not through just HTTP, but handling events from an external source) but equally it can happen randomly when nothing is really happening on the pod which has caused us quite a substantial amount of confusion.

Expected behavior

Either more information on what the cause might be, or some solution to the issue. I realise the driver might not actually know the issue and it may really be a timeout to it’s point of view. We’re not entirely sure where the problem lies yet, which is the biggest issue.

Further technical details

Microsoft.Data.SqlClient version: 3.0.1 .NET target: Core 3.1 SQL Server version: Microsoft SQL Azure (RTM) - 12.0.2000.8 Operating system: Docker Container - mcr.microsoft.com/dotnet/aspnet:3.1

Additional context

  • Running in AKS, against Elastic Pools.
  • SQL Server shows no long running queries
  • We sometimes get a TimeoutEvent from the metrics that are collected from the pool. On occasions when we do get them, the error_state will be different.
    • For example, we had one this morning that was 145. We don’t know what this means can find no information on what these relate to. I’ve raised a ticket with the Azure Docs team to look at this. I’ll add more onto this when they happen as we’ve not been keeping track of the error_state codes as we’re not sure if they’re even relevant.
  • This might be related to this ticket - https://github.com/dotnet/SqlClient/issues/647
    • However we don’t see the ReadSniSyncOverAsync
  • We do have event counter metrics being exported to Prometheus but have found no obvious indicators that something is wrong

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 14
  • Comments: 60 (12 by maintainers)

Most upvoted comments

Hello, Azure Support just told us that the product group identified the issue and will fix it Q4 2022.

Regards,

We are also experiencing this issue with Linux App Services against Azure SQL. We only began seeing this after migrating our App Services from Windows to Linux.

@dazinator we only tested containers on different Linux host Systems to see if it’s possible related to a certain kernel version or not but seems like it’s not, as said we stopped testing after the issue disappeared after migrating the sql server to a newer version, just wanted to share our experience maybe it helps others somehow

Azure Support just told us that the product group identified the issue and will fix it Q4 2022.

@DLS201 any chance you have more information on this? Maybe a link to a bug or something? We are investigating this issue currently and would like to know what MSFT knows…

@SharePointX have you tried increasing the connection timeout to make sure it won’t be timed out while retrieving data from the server and how it affects the execution?

I’d suggest retrieving data in smaller batches if it fails with large portions to find the balance. If the server is not available temporarily (especially over the network), I’d recommend doing the retry pattern on the the application level with the remaining data.

There are some thread starvation scenarios with thread pool and connection pool, which I believe don’t apply to your case with a single-threaded app. I don’t have any idea about warming up the database, you may ask the SQL Server Dev team.

We’ve had this same issue multiple times. Had a ticket open with Microsoft on the SQL side for weeks, in the end they could not find any long running queries, although we had Application Insights reporting some queries taking over 2 minutes, which is also not possible because our .NET SQL command timeout was set to 30 seconds in most cases, and never more than 1 minute. So they ruled out any SQL issue.

We had a separate ticket open with the App Service team (we run on App Service with Linux Docker containers) - and it also completely stumped them - no issues with threadpool, networking etc. In the end I have had to scale up to many more instances than should be necessary, and the issue has not recurred but each instance is only able to handle about 6 requests per second which is super low, given the average API request response time is under 500ms. So no idea of the root cause of this.

When the issue occurred, on our 24 vCore provisioned Azure SQL database, only about 10% usage was reported, so we were nowhere near database resource limits.

@dazinator SQL Server does report on long-running queries, hence why we assumed that the error was happening at the application level. I don’t have the information to hand anymore having left the company but given these were 30+ seconds of waiting, it would have shown up in some of the information tables SQL Server provides.

We spent a considerable amount of time looking into the causes, and it was happening in multiple services for different queries and in different environments, some of which had little to no data that would even cause a 30+ second slowdown.

I understand the need to be thorough but we had run through a lot of the steps you have outlined. I just don’t want this ticket to get lost in the ether because people assume it’s laziness on our part.

To continue the conversation about this issue being related to Linux and to hopefully narrow in on an event … our Azure Application Insight logs show that these random timeouts AND application performance issues all started after this Azure East US maintenance window QK0S-TC8 did SOMETHING to Azure functions host on August 17th 9:00PM EST. Something in that host update caused this unknown error 258 to start appearing.

At that point in time, our host went from Linux 5.10.177.1-1.cm1 to Linux 5.15.116.1-1.cm2 and application performance tanked shortly after, and we now have the sudden appearance of these Unkown error 258 throwing exceptions. Some metric digging shows that memory usage (P3V3 with 32GBs) tanked along with performance.

image

No code or db changes on our part, just suddenly the AppInsight logs show kernel 5.15, odd timeout errors, and we presumably can’t access memory. @eisenwinter - you said you tried a few different kernels, did you go back to 5.10? Anyone else seeing this issue on 5.10 or lower?

Update: We converted the function app over to Windows. The sql timeout errors are gone and application performance is restored! 🎉

Just want to mention that we experienced a similar issue and converting to Windows seems to have solved the problem.

To continue the conversation about this issue being related to Linux and to hopefully narrow in on an event … our Azure Application Insight logs show that these random timeouts AND application performance issues all started after this Azure East US maintenance window QK0S-TC8 did SOMETHING to Azure functions host on August 17th 9:00PM EST. Something in that host update caused this unknown error 258 to start appearing.

At that point in time, our host went from Linux 5.10.177.1-1.cm1 to Linux 5.15.116.1-1.cm2 and application performance tanked shortly after, and we now have the sudden appearance of these Unkown error 258 throwing exceptions. Some metric digging shows that memory usage (P3V3 with 32GBs) tanked along with performance.

image

No code or db changes on our part, just suddenly the AppInsight logs show kernel 5.15, odd timeout errors, and we presumably can’t access memory. @eisenwinter - you said you tried a few different kernels, did you go back to 5.10? Anyone else seeing this issue on 5.10 or lower?

Update: We converted the function app over to Windows. The sql timeout errors are gone and application performance is restored! 🎉

Yes, we tried upgrading to .NET 7 and the latest SNI package release (5.1.1). It didn’t fix it. The only thing that has made any difference was the suggestion to bump the min thread count. We bumped it to 20 and now very rarely see the issue (maybe 1/day or so).

Nothing seems to eliminate it so far but the thread count workaround does help.

Turns out our issue was simply a genuine long running (dynamic based on filters the user selects on front end) query hitting the timeout with some filter selections. We had assumed (that magic word) the particular query involved (which happens sporadically) was ok because similar (but not identical) queries were ok. When we intercepted the T-SQL of a query that reached the timeout and re-ran it exactly as is we replicated the timeout and perf issue. Slightly embaressing but there you go - make sure you log the exception and try to intercept the exact T-SQL and re-run it to make sure it’s not just a genuine query perf issue 😉

You mean like:

  • Microsoft.Data.SqlClient v4.1.1
  • .NET 6
  • Docker aspnet:6.0-bullseye-slim
  • Host image: AKSUbuntu-1804gen2containerd-2022.08.29