npgsql: Intermittent "Unknown message code: X" exceptions. affects 5.X and 6.x

We’ve been getting an intermittent exception “Unkown message code: X” where X ranges from 0-255. This was happening before we upgraded Npgsql.EntityFrameworkCore.PostgreSQL to 6.0.3 and is still continuing now. There doesn’t seem to be a pattern as to what causes it, it comes from various queries across our application.

Version info

PostgreSQL 13.4
Multiplexing=true
"Npgsql.EntityFrameworkCore.PostgreSQL" Version="6.0.3"
"Npgsql.EntityFrameworkCore.PostgreSQL.NetTopologySuite" Version="6.0.3"
"Npgsql.EntityFrameworkCore.PostgreSQL.NodaTime" Version="6.0.3"

Example stack trace

Npgsql.NpgsqlException (0x80004005): Unknown message code: 105
   at Npgsql.Util.PGUtil.ValidateBackendMessageCode(BackendMessageCode code)
   at Npgsql.Internal.NpgsqlConnector.ReadMessage(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   ...(tracing)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken)
   ...(tracing)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass33_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
   at Microsoft.EntityFrameworkCore.EntityFrameworkQueryableExtensions.ToListAsync[TSource](IQueryable`1 source, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.EntityFrameworkQueryableExtensions.ToListAsync[TSource](IQueryable`1 source, CancellationToken cancellationToken)
   ...(API controller)
   at lambda_method4720(Closure , Object )
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.AwaitableObjectResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Logged|12_1(ControllerActionInvoker invoker)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeNextExceptionFilterAsync>g__Awaited|26_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)

Sample of error codes:

image

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 21 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@tomwollert @mintsoft @vonzshik

I’ve had a look at this issue using TomWollert’s reproduction code above and it seems that the issue stems from the reuse of a Sslstream after an IOException has been thrown on the Read() call. Reading the MSDN docs it states that;

SslStream assumes that a timeout along with any other IOException when one is thrown from the inner stream will be treated as fatal by its caller. Reusing a SslStream instance after a timeout will return garbage. An application should Close the SslStream and throw an exception in these cases.

I believe the read buffer data TomWollert shows above [23,3,3,1,141,…] Is the garbage data referred to in the MSDN article. This also lines up with the versions mentioned above as the query cancellation code added between versions 4.1.10 to 5.0.0 cancels the ongoing query and attempts to reuse the underlying stream in both Insecure And Secure Connections.

This can be seen in NpgsqlReadBuffer.cs where we continue back around the loop to attempt to read more data from the stream.

Update: I’ve done some more testing around this and it seem that this issue only effects .Net Framework applications, as .Net5 , .Net6 and .Net Core 3 deal with the IOException gracefully on the SslStream, Even thought the MSDN docs have the same message on all these versions of .Net .

@mintsoft @tomwollert sorry guys, was busy with personal stuff.

@AlisterGreg Yeah, you’re absolutely correct - the issue definitely was due to us attempting to save a connection after a timeout. This also explains why I haven’t ever been able to reproduce this error, since it can only reproduced if:

  • You use a secure connection
  • You use .NET Framework
  • You send synchronous queries

Just some context for the error. It can be only reproduce with Npgsql 5.0 and above because that’s when we’ve introduced cancellation (in a sense that we attempt to cancel a query after a timeout). I can guess that it can’t be reproduced on .NET Core 3 because at that point the runtime started supporting socket async operation cancellation (on .NET Framework you can’t cancel Socket.RecieveAsync because there is no overload accepting a CancellationToken), and to support the cancellation SslStream has been rewritten (otherwise that would have lead to the exact same problem). Graceful handling of a TimeoutException must have been just a byproduct of this.

As a workaround, you can add CancellationTimeout=-1 to a connection string - ~that way, while the long running query wouldn’t be cancelled on the database, you also wouldn’t get the Unknown message code exception~ nevermind, we always cancel the query no matter the CancellationTimeout value, so that workaround should work perfectly.

I’ve submitted a PR fixing this error, hopefully this shouldn’t be a problem anymore.

@tomwollert the exact error code (3, 23, 41, etc) usually don’t matter that much, since the error itself indicates that the protocol sync has been lost between Npgsql and PostgreSQL, and at that point anything is possible. I’m going to try and see whether I can reproduce it on PostgreSQL 10…

@vonzshik That sucks 😕 I’ll try to get a standalone console app that reproduces it. I’m assuming it’s some dependency.

Replicating message code 3:

var connectionString = "{server/username/password};CommandTimeout=5;Connection Idle Lifetime=50;Keepalive=5";
            var commandTimeoutTask = Task.Run(() =>
            {
                try
                {
                    using (var connection = new NpgsqlConnection(connectionString))
                    {
                        var stopwatch = Stopwatch.StartNew();
                        connection.Open();

                        using (var command = new NpgsqlCommand("select pg_sleep(10)", connection))
                            command.ExecuteNonQuery();
                    }
                }
                catch (Exception e)
                {
                    Console.WriteLine(e.Message);
                }
            });

            var tryingToGetBrokenConnection = Task.Run(() =>
            {
                while (true)
                {
                    using (var connection = new NpgsqlConnection(connectionString))
                    {
                        connection.Open();
                        using (var command = new NpgsqlCommand("select 1", connection))
                            command.ExecuteNonQuery();
                    }
                }
            });
            commandTimeoutTask.Wait();
            tryingToGetBrokenConnection.Wait(TimeSpan.FromSeconds(5));

Though I guess this will also fail to reproduce as the 23 on one triggers the 3 on the other statement.

K, “23” is very straight forward:

var connection = new NpgsqlConnection("{server/username/password};CommandTimeout=30;Connection Idle Lifetime=30;Keepalive=5");

            var stopwatch = Stopwatch.StartNew();
            connection.Open();
            using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
            {

                using (var command = new NpgsqlCommand("select pg_sleep(40)", connection))
                {
                    command.ExecuteNonQuery();
                }

                transaction.Commit();
            }

            Console.WriteLine(stopwatch.Elapsed.TotalMilliseconds + " ms have passed");

This will throw the 23.