SqlClient: The MARS TDS header contained errors using ASP.NET Core and EF Core connecting to Azure SQL Server

Copied from here:

I have an ASP.NET Core app [Microsoft.AspNetCore.App 2.1.4] with EF Core 2.1.4 [DbContext pooling enabled] and data stored on an Azure SQL database.

Occasionally [once in 1-2 days] I get unrecoverable error stating System.Data.SqlClient.SqlException (0x80131904): The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.

Once I restart the app, everything resumes working until it happens again.

Probably import to note is that I don’t have MultipleActiveResultSets enabled in my connection string at all, which makes the issue even more strange.

Has anyone else noticed something similar? Are there any ways I can trace the problem?

Stack trace:

System.Data.SqlClient.SqlException (0x80131904): The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
   at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at Microsoft.EntityFrameworkCore.Storage.Internal.RelationalCommand.Execute(IRelationalConnection connection, DbCommandMethod executeMethod, IReadOnlyDictionary`2 parameterValues)
   at Microsoft.EntityFrameworkCore.Storage.Internal.RelationalCommand.ExecuteReader(IRelationalConnection connection, IReadOnlyDictionary`2 parameterValues)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.BufferlessMoveNext(DbContext _, Boolean buffer)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementation[TState,TResult](Func`3 operation, Func`3 verifySucceeded, TState state)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.MoveNext()
   at Microsoft.EntityFrameworkCore.Query.Internal.LinqOperatorProvider.ExceptionInterceptor`1.EnumeratorExceptionInterceptor.MoveNext()
   at System.Collections.Generic.List`1.AddEnumerable(IEnumerable`1 enumerable)
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Inviton.Web.Portal.Controllers.EventListController.GetEvents(BL bl, Int32 venueId, Int32 promoterId) 

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 20
  • Comments: 221 (95 by maintainers)

Commits related to this issue

Most upvoted comments

These errors have completely disappeared for us after upgrading to EF Core 3.1/Microsoft.Data.SqlClient 1.1.0.

@mshenoy83 Sorry it took a bit longer, here goes my dirty hack. In order to use it, please check the “Custom execution strategy” section of this documentation https://docs.microsoft.com/en-us/ef/core/miscellaneous/connection-resiliency

class CustomExecutionStrategy:

using Brulasoft.Inviton.Data.Internal.SQL;
using Microsoft.EntityFrameworkCore;
using Microsoft.EntityFrameworkCore.Storage;
using System;
using System.Data.SqlClient;

namespace Brulasoft.Inviton.Data.Internal
{
    /// <summary>
    /// Custom SQL execution
    /// </summary>
    internal class CustomExecutionStrategy : SqlServerRetryingExecutionStrategy
    {
        public CustomExecutionStrategy(ExecutionStrategyDependencies dependencies) : base(dependencies)
        {
        }

        protected override bool ShouldRetryOn(Exception exception)
        {
            if (base.ShouldRetryOn(exception))
            {
                return true;
            }

            if (exception is SqlException sqlException)
            {
                if (SqlRecoveryAttemptor.ShouldTryRecover(exception, ExceptionsEncountered.Count) && Dependencies != null && Dependencies.CurrentDbContext != null)
                {
                    var context = Dependencies.CurrentDbContext.Context;
                    if (context != null)
                    {
                        var sqlConn = context.Database.GetDbConnection() as SqlConnection;
                        if (sqlConn != null)
                        {
                            SqlRecoveryAttemptor.TryRecover(sqlConn, ExceptionsEncountered.Count);
                            return true;
                        }
                    }
                }
            }

            return false;
        }
    }
}

class SqlRecoveryAttemptor

using System;
using System.Collections.Generic;
using System.Data.Common;
using System.Data.SqlClient;
using System.Text;

namespace Brulasoft.Inviton.Data.Internal.SQL
{
    public class SqlRecoveryAttemptor
    {
        public const int RetryCount = 4;


        public static bool ShouldTryRecover(Exception ex, int retryAttemptsCount)
        {
            return retryAttemptsCount < RetryCount && IsMARSError(ex);
        }

        public static bool IsMARSError(Exception ex)
        {
            return ex.Message.Contains("MARS TDS header");
        }

        public static void TryRecover(SqlConnection sqlConn, int retryAttemptsCount)
        {
            if (retryAttemptsCount > 2)
            {
                System.Threading.Thread.Sleep(1500);
            }

            if (sqlConn.State != System.Data.ConnectionState.Closed)
            {
                TryClose(sqlConn);
                TryClearPool(sqlConn);
                TryOpen(sqlConn);
            }
            else
            {
                TryOpen(sqlConn);
                TryClose(sqlConn);
                TryClearPool(sqlConn);
                TryOpen(sqlConn);
                TryClose(sqlConn);
            }
        }

        private static void TryClearPool(SqlConnection conn)
        {
            try
            {
                SqlConnection.ClearPool(conn);
            }
            catch (Exception)
            {
            }
        }

        private static void TryClose(DbConnection conn)
        {
            try
            {
                conn.Close();
            }
            catch (Exception)
            {
            }
        }

        private static void TryOpen(DbConnection conn)
        {
            try
            {
                conn.Open();
            }
            catch (Exception)
            {
            }
        }
    }
}

We started observing this problem in production sporadically. After digging in our telemetry, we noticed that the exceptions listed ☝️ were always proceeded by exceptions in other requests (that likely shared the same Sql Connection Pool) that indicated that they had called .Cancel() on their SqlCommand in response to their request being canceled. After some experimentation we able to create a sample Console application that appears to reproduce the problem very constantly. The example application can be found here https://github.com/boxofyellow/MarsHeader

The cause of the problem is in the test ToListAsync_can_be_canceled, it breaks something which then shows the next time you run a query. The Average_on_nav_subquery_in_projection is just a victim because it happens to be next to the other test in the definition order and presumably the test runner walks backwards in vtable order, not that It matters.

A single breaking test is this:

[ConditionalFact]
public async Task BreakTDSStream()
{
    for (var i = 0; i < 10; i++)
    {
        // without fix, this usually throws within 2 or three iterations

        using (var context = CreateContext())
        {
            var tokenSource = new CancellationTokenSource();
            var query = context.Employees.AsNoTracking().ToListAsync(tokenSource.Token);
            tokenSource.Cancel();
            try
            {
                await query;
            }
            catch (Exception)
            {
            }
        }
    }
    using (var context = CreateContext())
    {
        var results2
            = await context.Customers.Select(
                    c => new { Ave = c.Orders.Average(o => o.OrderID) })
                .ToListAsync();
    }
}

Which it seems pretty likely the cancellation is causing a problem somehow and the query outside the loop pulls an invalid connection from the pool and tries to use it.

Observations:

  • if you try to run ExecuteNonQueryAsync with a cancelled token it doesn’t even go as far as checking if there’s a valid open connection ( I forgot a connection string in a test project and was surprised it didn’t throw) so I think it’s unlikely that it’s the read, I think it’s the OpenAsync with cancelled token but I can’t find where the connection is actually created/opened in entity framework.
  • it is speed/timing related because if you attempt to step through it sometimes it won’t occur. This suggests that an unfinished connection is returned to the pool somehow to me.
  • We’re absolutely going to need the EF team to help by removing all these layers of abstraction to pare it down to a sequence of operations directly on SqlClient. Async lifecycle management is bad enough without all the extra baggage on top of it.
  • I believe the query being run is irrelevant to the problem

I’ve been digging into this and wanted to summarize my finding so far, in case someone else is looking, too.

The error occurs because _nonTransactedOpenResultCount in TdsParser.cs goes below 0 in certain situations. _nonTransactedOpenResultCount gets sent in the MARS header data and values < 0 are not valid.

This error was introduced to System.Data.SqlClient on .NET Core in this PR: https://github.com/dotnet/corefx/pull/26595 Which removes stateObj._syncOverAsync = true; from TryProcessDone() in TdsParser.cs

The issue also exists against .NET Framework if you set this context switch to false (it’s true by default): Switch.Microsoft.Data.SqlClient.MakeReadAsyncBlocking

The comment associated with the line above ("// Can’t retry TryProcessDone) tells me that TryProcessDone might be completing more than once when it should not be, since at the end of that method is where _nonTransactedOpenResultCount is getting decremented.

We could simply revert the change from the above PR, but that would re-introduce what it was trying to fix. We could also make sure _nonTransactedOpenResultCount simply never gets decremented below 0, but that doesn’t seem right, off hand. Unless we can figure out a better solution, we may just have to revert the PR.

Repetitively running the sample application over night (about 1500 runs) with the package from #490, did not yield any MARDS TDS header exceptions.

@Wraith2 do you know if this is still an issue on .NET Core 3.0? Will upgrading solve the problem?

It is. I just finished upgrading our entire codebase to .net core 3 and EF core 3 and pushed it out to QA, and low and behold, got the error in the new Microsoft.Data.SqlClient. Was rather disappointed.

EDIT: After a few days of observing the EF Core 3/Microsoft.Data.SqlClient changes on our QA systems, we’re seeing this error a lot. Averaging twice a day so far. Previously, we never saw it on QA system (because load was so low I presume) and only saw it in production maybe once every 2 months or so. The only saving grace with the new code is that it seems to be able to self recover without having to restart the app/reboot.

To give some more details about our environment:

  • This is a single console app targetting netcoreapp3.0 running on Windows Server 2019 that makes heavy use of async/await to run many “worker” type jobs. ie. grabbing messages from a message queue and processing them in parallel, which often involves work against our Azure SQL DB instance.
  • So far we have not yet seen this happen on our web heads which are running IIS on Windows Server 2019 proxying requests to an asp.net core 3 app.

EDIT 2: After deploying EF Core 3 upgrade to production, we started getting 10’s of thousands of these exceptions per hour. Luckily, it was contained to our worker tier, which is pretty resilient and we didn’t have to do much to mitigate, it just made our monitoring a mess. I had to deploy @brunolau’s wonderful solution and now we’re only seeing 10’s per hour before it resets the connection pool.

It’s really curious that this became a much bigger problem with Microsoft.Data.SqlClient and EF Core 3.

We confirm that the updrade to System.Data.SqlClient 4.8.2 solve (for us at least) the TDS/MARS errors.

We have considered to backport this fix to v1.1.x as well due to it’s high demand, this version will be available sooner than GA v2.0.

Sounds good to me, but to be fair, it may not be completely gone. This is only 5 days of logs now where we haven’t seen the issue, whereas it was occurring thousands of times per hour on 3.0/1.0, so definitely a positive sign.

However, when were on EF Core 2.2 and the old SqlDataClient, we saw this issue maybe once every month or two. So that may still be the case, who knows, only time will tell.

Yes. That test alone reproduced the tds error we’re looking for 100% of the time against my local sql instance.

This repository only provides update to Microsoft.Data.SqlClient, however System.Data.SqlClient was updated with this fix too for v4.8.2. Please use v4.8.2+ to get this fix.

@Wraith2 @cheenamalhotra

Here you go… https://github.com/KondzioSSJ4/Repro-for-SQLClient-Issue-85 Sample of how to reproduce this bug (The incoming tabular data stream (TDS) protocol stream is incorrect. The TDS headers contained errors.) in test after 100 tries i get 51 fails

P.S. I run it on Windows 10 Pro v. 1909

var cancellationTokenSource = new CancellationTokenSource(timeout);
cancellationTokenSource.CancelAfter(timeout);

I don’t think you should need to do both of those things. you initializer the CTS with a timeout and then set that timeout again, that’s just going to set two timers as far as I know.

If you can produce that as an isolated repro against the sample Northwind database it’d be helpful, as you can imagine it’s hard to identify behaviour based on only a stacktrace.

If you encounter this problem you cane end up with a connection pool which contains only poisoned connections and that effectively prevents you connecting to the database again. The way to triage that is to clear the pool and there is sample code provided further up in this thread which does that. The nuclear option to clear the pool is to tear down the entire client process.

To be clear It doesn’t require an sql instance restart and the error is entirely in the client protocol handling and does not cause a problem on the server side.

@cheenamalhotra After the release, please let me know that I’ve been longing to use it. Thanks.

I completed running all the EF Core tests in “release/3.0” tag (including BreakTDSStream test as mentioned above) by direct project import of Microsoft.Data.SqlClient (netcore) and all tests are passing with the PR #490 changes, also confirmed failure with “master” branch.

We will proceed with PR merge for next release!

@ajcvickers any chance of getting someone on the EF side to see if they can assist with disentangling what EF is doing to the repro we’ve got so far so we can investigate this as the driver issue it is?

While debugging this issue, I just happened to change my branch today to @Wraith2’s PR https://github.com/dotnet/SqlClient/pull/248 and the PR seems to fix this issue. Reverting my branch to master brings back failures.

The use case that fails gets it’s trigger from concurrent runs of CancellationTokenSource().Cancel() where Attention signals are sent to SQL Server, and the PR does seem to work on the same paths by locking Send() to avoid race conditions.

Although the PR still needs some work as some test cases seem to fail, but it’s also seems to be making a big difference leading to fix this issue.

@Wraith2 would you like to give it a try and confirm the same? As mentioned above, this error is reproducible with the test posted above and release/3.0 branch of EF Core. Also would appreciate if you can address pipeline failures!

No problem!

@ItWorksOnMyMachine

Please upgrade version of System.Data.SqlClient driver referenced in your application to v4.8.2 that contains this fix (for .NET Core).

The package referenced by EF Core is not updated to latest Microsoft.Data.SqlClient version, hence you will see the issue. There’s an open PR to address the same: https://github.com/dotnet/efcore/pull/20377, currently on hold.

You may do a direct reference of this NuGet in the meantime!

Even if it’s backported to System it’s a good idea to look into moving to Microsoft version as the system version will only updated for severe reasons while the version in this repo will get perf updates features etc.

Put the file in a folder and then add a nuget config file with a nuget package source pointing at that folder, https://docs.microsoft.com/en-us/nuget/reference/nuget-config-file#package-source-sections

Actually what is the best option to avoid this blocking issue ? Because we’re really in a bad situation here…

Thanks @boxofyellow

We’ve been eagerly waiting for a repro to address this issue, I’ll give this a try soon!

Reporting back that since updating to EFCore 3.1 and SqlClient 1.1, we haven’t had any more occurrence as well. But still probably worth investigating it to confirm that is is fixed and not just less likely to happen with the risk to regress with future unrelated changes.

@cheenamalhotra I agree with @mscrivo. Closing this seems premature. Not observing it for a few days is not the same as proving that the issue is resolved. (especially because the issue is intermittent and hard to reproduce). When we were on SqlClient 1.0.0, we didn’t observe it for 3 months until it suddenly became an issue.

@AfsanehR, @David-Engel we are now hitting this exception with an equivalent stack trace with Microsoft.Data.SqlClient while running our EF Core tests: https://github.com/aspnet/EntityFrameworkCore/issues/16218.

It is hard to isolate a repro (e.g. my understanding is that it only occurs in a complete test run, but not if we run specific tests individually) but it is possible that what we have is enough to help make progress with the investigation.

Any chance that this can be prioritized for 1.0? It seems many customers have reported it already.

cc @smitpatel, @ajcvickers

This issue started occurring for us on 8/12, which is well after these fixes have been backported to previous versions. It started occurring in a current release, that wasnt previously having the issue. The company did VMWare upgrades it appears, could this be something that is causing this?

Hi, we are facing the same issue, I have .Net core 2.1 application using Dapper 2.0.30 version as ORM and Sql databaseto store the data. my code is running on windows servers, occasionally (once in a week) getting an exception System.Data.SqlClient.SqlException ex.Message : The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors and it is getting auto resolved with out restarting the servers. Please let me know if you need any additional information.

@cheenamalhotra Rockstar! Thank you, this solved it!

@lioa123

Is it more probable this issue will appear during heavy load on the application server, which preforms unusual amount of DB calls?

No this is not related to heavy work load as it was reproducible with a test case from EF Core tests, which later on was adopted in SqlClient tests too.

does this error occur only when using EF, or it can also occur when using ADO .net

It was reproducible with ADO.NET or SqlClient as well, so it can occur in any application irrespective of EF Core in use.

The fix applies to all networking strategies, native and managed, equally.

This sounds great–will the fix be backported to https://www.nuget.org/packages/System.Data.SqlClient? Otherwise, I can look into switching over to Microsoft.Data.SqlClient.

ooo my @Wraith2 you are right, the missing ‘no’ makes all the differences… I apologize for the confusion. So to be clear 2500 clean runs.

@cheenamalhotra Under the 300 rpm load test, I ran for 30 minutes, but I didn’t find any error. It’s great.

@jaredhshuai @michaeltnguyen @ErikEJ and everyone else following this thread:

Thanks for all the support you’ve provided in diagnosing and getting the fix ready for this issue. We released v1.1.2 today that contains the fix, please give that a try and let us know if that fixed your problem by upvoting!

We will keep the issue open for accepting feedback!

The fact that it’s working with master branch and not this one, is confusing. However, I did make it work in the past. Will give it a try!

We shall consider backporting it to System.Data.SqlClient (netcoreapp3.1) if it proves to be promising and we gather confirmation from users from different domains.

cc @David-Engel for final confirmation.

I pulled down the nuget package listed in the PR ☝️, it looks like it solved the problem for the Microsoft package. I’ll run a batch of them overnight to see if anything else shakes out. Is there any chance we can expect to see a similar fix for the System package? We might take this as the catalyst to switch over to the Microsoft package, but that will dramatically slow down rolling the change out into production.

@clement911 You can use my nasty execution strategy workaround mentioned in the comments above. It still does its job even in EF core 3 (with some slight modifications)

I ran @Wraith2 's test in #85 on the master branch and it no longer reproduces this issue. Unfortunately, it appears that the issue is still happening in the wild, so we’re back at square one. =/

That’s scary!

+1 Perhaps we should continue this conversation in private until we fully understand the impact of this bug…

<strike> We had the same problem as well. We recorded thousands of these errors with no real solution. To point out, we aren't on Azure SQL Server and we aren't using EF. As it turns out the issue to our problem was actually the web server.

We were able to correlate this to traffic because around around 10:30 - 11:00 am (on random days, omitting weekends because our traffic dramatically drops off during the weekend) we’d get these errors. As it turns our, we reviewed our GA reports and compared them to our error handling data and low and behold they matched fairly well (not perfectly though).

Our production web server was running about 8GB of RAM but this wasn’t enough to compensate for the amount of traffic we were getting.

We doubled the memory and boom, all the issues went away no more MARS errors.

Again, this may not have any relation to anyone whose posted so far but I found this thread via a Google search so it might help someone whose going through this right now.

Hope it helps someone,

Cheers. </strike>

See new post for clarification.

We ran into this issue using Entity Framework Core 2.2.4. Also using Azure SQL.

Our code that produced this problem was roughly as followed:

var queryable = dbContext.Set<SomeEntity>()
  .Where(e => e.SomeBit = false);

if (command.Filter != null)
    queryable = queryable.Where(command.Filter);

var subQuery = queryable
  .GroupBy(e => e.ExternalReference, (key, elements) => new
  {
    ExternalReference = key,
    Created = elements.Max(e => e.Created)
  })
  .OrderByDescending(q => q.Created)
  .Skip(skip)
  .Take(query.Size);

queryable = dbContext.Set<SomeEntity>()
  .Join(subQuery,
    e => new { e.ExternalReference, e.Created },
    e => new { e.ExternalReference, e.Created },
    (e, _) => e)
  .OrderByDescending(q => q.Created);
  
var result = await queryable.ToListAsync();

The query that is produced by EF Core is:

SELECT [r].[Id], [r].[Created], [r].[Deleted], [r].[ExternalReference], [r].[SomeId]
FROM [SomeEntity] AS [r]
INNER JOIN (
    SELECT [r0].[ExternalReference], MAX([r0].[Created]) AS [Created]
    FROM [SomeEntity] AS [r0]
    WHERE ([r0].[SomeBit] = 0) AND ([r0].[SomeId] = @__someIdId_0)
    GROUP BY [r0].[ExternalReference]
    ORDER BY [Created]
    OFFSET @__p_1 ROWS FETCH NEXT @__p_2 ROWS ONLY
) AS [t] ON ([r].[ExternalReference] = [t].[ExternalReference]) AND ([r].[Created] = [t].[Created])

The idea here is that we take the newest item for every external reference and page these results.

This works without any problems UNTIL I hit the last page with incomplete page (fetch 20 get 19). Then I either get a query timeout or the “MARS TDS header contained errors” error.

The solution somewhat puzzled me, but it works…

Where we join our dbset with the subquery we get the DbSet again:

...
queryable = dbContext.Set<SomeEntity>()
  .Join(subQuery,
    e => new { e.ExternalReference, e.Created },
...

When we reuse the DbSet that is acquired on line 1 this works without any issues:

...
queryable = queryable
  .Join(subQuery,
    e => new { e.ExternalReference, e.Created },
...

In the end this is now our working code:

var baseQueryable = dbContext.Set<SomeEntity>();
var queryable = baseQueryable.Where(e => e.SomeBit = false);

if (command.Filter != null)
    queryable = queryable.Where(command.Filter);

var subQuery = queryable
  .GroupBy(e => e.ExternalReference, (key, elements) => new
  {
    ExternalReference = key,
    Created = elements.Max(e => e.Created)
  })
  .OrderByDescending(q => q.Created)
  .Skip(skip)
  .Take(query.Size);

queryable = baseQueryable
  .Join(subQuery,
    e => new { e.ExternalReference, e.Created },
    e => new { e.ExternalReference, e.Created },
    (e, _) => e)
  .OrderByDescending(q => q.Created);
  
var result = await queryable.ToListAsync();

What I find weird here is that the same query is produced and that it only causes problems on the last page of data.

Hope this helps in finding the problem.

I managed to temporarily auto-mitigate the issue by creating custom Entity framework execution strategy that once encounters the MARS error, closes the connection, clears the pool and reopens. Since I’ve introduced this workaround, I’ve been running errorless for 3 days.

To sum it up. Once the error is encountered, closing & reopening the error connection + clearing the connection pool seems to re-establish working order