npgsql: Suspected race condition in NpgsqlConnector.EndUserAction

Steps to reproduce

Unfortunately, so far I have not been able to identify steps to reproduce this problem. We only know that most often the problem occurs when there are certain problems on the database server

The issue

From time to time I see an error with the following content:

Exception type: System.Threading.SemaphoreFullException
Exception message: Adding the specified count to the semaphore would cause it to exceed its maximum count.
Stack trace: 
at System.Threading.SemaphoreSlim.Release(Int32 releaseCount)
   at Npgsql.NpgsqlConnector.EndUserAction()
   at Npgsql.NpgsqlDataReader.Cleanup(Boolean async, Boolean connectionClosing)
   at Npgsql.NpgsqlDataReader.Close(Boolean connectionClosing, Boolean async)
   at Npgsql.NpgsqlConnector.CloseOngoingOperations(Boolean async)
   at Npgsql.NpgsqlConnection.<Close>g__CloseOngoingOperationAndFinish|76_0(NpgsqlConnection connection, NpgsqlConnector connector, Boolean async)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.CloseAsync()
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.CloseAsync()
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.CleanupCommandAsync(DbCommand command, IRelationalConnection connection)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
   at Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

After that, using dotnet-counters, I see an increased number of busy connections. If you look at pg_stat_activity, we see that there are actually no “busy” connections, there are Idle connections that are no longer used for Keep Alive.

Further technical details

Npgsql version: 4.1.6 Microsoft.EntityFrameworkCore version: 3.1.10 PostgreSQL version: PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit Operating system: Linux centos-release-7-5.1804.el7.centos.x86_64

Other details about my project setup:

Keepalive enabled! ConnectionString: User ID=user;Password=password;Host=host;Port=5432;Database=dbname;Keepalive=1;Pooling=true;Minimum Pool Size=69;Maximum Pool Size=70;Command Timeout=300;Application Name=AppWithProblems;Max Auto Prepare=150;

I looked at the code, but the only place I have doubts about is NpgsqlConnector.EndUserAction

if (_isKeepAliveEnabled)
{
    lock (this)
    {
       if (IsReady || !IsConnected)
         return;

        var keepAlive = Settings.KeepAlive * 1000;
        _keepAliveTimer!.Change(keepAlive, keepAlive);

         Log.Trace("End user action", Id);
         _currentCommand = null;
         _userLock!.Release();
         State = ConnectorState.Ready;
    }
}

Specifically the following code:

 if (IsReady || !IsConnected)
   return;

Here, two comparisons are made with a variable that can potentially change at the moment when the first comparison has already been made, and the second has not yet. (Both properties (IsReady and IsConnected) compare the State property to ConnectorState enum elements. State property declared as:

internal ConnectorState State
        {
            get => (ConnectorState)_state;
            set
            {
                var newState = (int)value;
                if (newState == _state)
                    return;
                Interlocked.Exchange(ref _state, newState);
            }
        }

What do you think about that? Could there be a race condition here, or am I wrong?

Would a fix like this help here ?:

//NpgsqlConnector.EndUserAction:
//...
if (_isKeepAliveEnabled)
{
    lock (this)
    {
        var state = State;
        if (state == ConnectorState.Ready || 
           state == ConnectorState.Closed ||
           state == ConnectorState.Connecting ||
           state == ConnectorState.Broken) return;
//...

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 18 (18 by maintainers)

Most upvoted comments

@neyromant just something that came to my mind.

at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.CleanupCommandAsync(DbCommand command, IRelationalConnection connection)
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)

From your stack trace, which is peculiar, as it should happen only if there was an exception while executing a reader (refs). Which leads me to believe that we get the exception which npgsql doesn’t handle that well, and it leaves the connection is a bad state. And after that, we get another exception on the cleanup, which swallows the original one. I think, you should enable the efcore logging, so we can get a clean picture of the original exception.

vonzshik on Feb 12, 2021

I did attempt to replicate the issue, but so far no problems(

vonzshik on Feb 5, 2021

@roji thanks for the detailed answer. Ok, now I have disabled Keepalive, I need to watch for a few days, and I will write the results. I will also check the applications for concurrency problems in more detail. I will not be able to try the new driver version earlier than in a couple of weeks.

neyromant on Feb 2, 2021