npgsql: Connection is closed after command timeout

I’m trying to get the code from #400 working again in Npgsql 3 (dev branch) and it now fails really badly: after the first timeout the connection is closed, so I cannot rollback to the savepoint and continue! Note that I haven’t modified the connection timeout - it’s still the default 15 seconds (though I don’t think it would even matter once the connection is open).

Code:

private static void TestTimeoutAndRetryIssue(NpgsqlConnection connection)
{
    using (var transaction = connection.BeginTransaction())
    {
        using (var command = new NpgsqlCommand("SELECT :value FROM pg_sleep(:seconds)", connection))
        {
            command.Parameters.AddWithValue(":value", "Initial");
            command.Parameters.AddWithValue(":seconds", 5);

            var result = ExecuteScalarWithRetry(transaction, command, 5, 1);
        }
    }
}

private static object ExecuteScalarWithRetry(NpgsqlTransaction transaction, NpgsqlCommand command, int delaySec, int initialTimeoutSec)
{
    transaction.Save("ExecuteScalarWithRetry");

    command.CommandTimeout = initialTimeoutSec;
    command.Parameters[":value"].Value = "Attempt 1";

    try
    {
        return command.ExecuteScalar();
    }
    catch (NpgsqlException ex)
    {
        Console.WriteLine("Exception in ExecuteScalarWithRetry: " + ex);
        Console.WriteLine("Connection state = " + command.Connection.State); // Closed!
    }
}

Exception:

Npgsql.NpgsqlException (0x80004005): Exception while reading from stream —> System.IO.IOException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. —> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) — End of inner exception stack trace — at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) at Npgsql.ReadBuffer.Ensure(Int32 count, Boolean dontBreakOnTimeouts) in E:\Source\npgsql\src\Npgsql\ReadBuffer.cs:line 115 at Npgsql.ReadBuffer.Ensure(Int32 count, Boolean dontBreakOnTimeouts) in E:\Source\npgsql\src\Npgsql\ReadBuffer.cs:line 133 at Npgsql.NpgsqlConnector.DoReadMessage(DataRowLoadingMode dataRowLoadingMode, Boolean isPrependedMessage) in E:\Source\npgsql\src\Npgsql\NpgsqlConnector.cs:line 884 at Npgsql.NpgsqlConnector.ReadMessageWithPrepended(DataRowLoadingMode dataRowLoadingMode) in E:\Source\npgsql\src\Npgsql\NpgsqlConnector.cs:line 856 at Npgsql.NpgsqlConnector.ReadMessage(DataRowLoadingMode dataRowLoadingMode) in E:\Source\npgsql\src\Npgsql\NpgsqlConnector.cs:line 836 at Npgsql.NpgsqlConnector.ReadExpectingT in E:\Source\npgsql\src\Npgsql\NpgsqlConnector.cs:line 1111 at Npgsql.NpgsqlDataReader.NextResultInternal() in E:\Source\npgsql\src\Npgsql\NpgsqlDataReader.cs:line 409 at Npgsql.NpgsqlDataReader.NextResult() in E:\Source\npgsql\src\Npgsql\NpgsqlDataReader.cs:line 320 at Npgsql.NpgsqlCommand.<Execute>d__81.MoveNext() in E:\Source\npgsql\src\Npgsql\NpgsqlCommand.cs:line 769 — End of stack trace from previous location where exception was thrown — at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Npgsql.NpgsqlCommand.<ExecuteScalar>d__98.MoveNext() in E:\Source\npgsql\src\Npgsql\NpgsqlCommand.cs:line 1066 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Npgsql.NpgsqlCommand.ExecuteScalar() in E:\Source\npgsql\src\Npgsql\NpgsqlCommand.cs:line 1041 at NpgsqlTest.Program.ExecuteScalarWithRetry(NpgsqlTransaction transaction, NpgsqlCommand command, Int32 delaySec, Int32 initialTimeoutSec)

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 15 (9 by maintainers)

Most upvoted comments

Our timeouts are not network related. They’re definitely due to the query taking too long on the server (so a server-side timeout worked well for us). Sometimes it’s just due to the load on the server at the time, so it makes sense to re-try. Other times something really weird is going on inside Postgres and we’ve just never been able to track down why a query that usually returns in 100 ms sometimes takes 5 minutes.

Wow, I can feel your pain. For cases where the server is overloaded, it may make sense to set timeout to a high value (or even disable timeouts altogether) - if the server is loaded you may as well wait for the query to complete, no matter how long it takes, rather than cancelling and rerunning it (which offers no advantage and actually repeats a lot of work).

However, for a query to take 5 minutes is probably something else. One frequent cause for this is some other connection holding a long-lived transaction which already updated a row you’re trying to update. PostgreSQL locks rows when they are modified in a transaction, so your 5-minute query is possibly waiting for the other transaction to complete. This would also explains why you can’t reproduce it - it’s a true concurrency issue. Long-lived transactions are in general a big problem, especially if you have many of them touching the same data.

While I’d really recommend attempting to solve the root cause, I do appreciate that it can be hard to track down issues like this, and hacky workarounds like timeout+retry are sometimes necessary.

If you do end up setting statement_timeout yourself, keep in mind that you have to reset it back after the query - it’s a session parameter that persists until changed. When Npgsql managed statement_timeout for you, it would check if the timeout change is necessary on each command.

I think it definitely makes sense to cancel the query when a timeout occurs. After all, if the Execute* method throws an exception there is no way the user can read the results. If the statement modified data the user would probably also have to roll back to a savepoint to get the DB to a known state (which is what we do).

I agree.

When a socket timeout occurs, an exception is thrown to the user. At this point the query results are still pending, while control has been transferred back to the user. The user may execute a new statement, and Npgsql has to internally track the state of things and be aware that the previous statement’s results must be consumed before the new statement’s results can be read

I think I see what you mean. Even if the query was cancelled, some results may still arrive afterwards? Is there a way to detect the end of the results for the cancelled query?

Yes, query cancellation is a best-effort asynchronous operation, which can cancel the current query at any point in time (or even not at all), so Npgsql must be prepared to consume any results from the cancelled query. Detect the end of the results of the cancelled query isn’t hard - every executed command is ultimately terminated by a Sync message, which produces a ReadyForQuery response which signals the end of the query. We’d have to track how many cancelled queries are still in-flight, and read that many ReadyForQuery responses. There are some complications but it’s doable.

If so, Npgsql could block until it receives them. That is, it could catch the socket timeout exception, send a cancellation to the server and wait for any remaining results for this query from the server before throwing a timeout exception to the user. This way it wouldn’t have to keep track of any special state for the next query.

Unfortunately that isn’t possible. In case of an actual network problem, this would make Npgsql block forever on the remaining results, defeating the whole purpose of the timeout. Whatever we do, an exception must be raised when the timeout occurs.

The larger issue here is that there are two very different scenarios being handled via the same timeout concept:

Network issue, dropped connection. If this occurs, it’s possible that nothing will ever be received from the socket. Sending a cancellation has no effect (network isn’t working). Socket timeouts are the only way to deal with this.
Query taking too long or deadlocking. In this case cancellation works well. statement_timeout can be used to deal with this.

To summarize, the solution here would be:

When a socket timeout occurs, keep track of how many ReadyForQuerys are pending on the connector, send a PostgreSQL cancellation and throw an exception. Note the Simple query optimization introduced in 3.2, ~~which means that a single NpgsqlCommand may produce multiple ReadyForQuerys~~ a single NpgsqlCommand may produce multiple CommandComplete’s, but always just one ReadyForQuery.
When we execute a new command, we first check whether ReadyForQuerys are pending, and if so, we attempt to consume them before (or after?) sending the new query. This may timeout again (e.g. ongoing network issue).

I’m not sure I’m going to have time to implement this in the very near future, there are many other things going on at the moment. But I’ll make an effort.

roji on Dec 18, 2016

Better (really) late than never 🤣

roji on Sep 18, 2020

Wow, fixed almost 4 years later! Thank you.

realityexists on Sep 17, 2020

That’s (mostly) right. Since 3.1.0, CommandTimeout is a client-side socket timeout only - Npgsql will simply throw an exception an close the socket, but this doesn’t affect the query running at the server. This issue is about changing the timeout behavior to send a server cancellation when the timeout occurs, but that’s not before 3.3.

In the meantime, it’s definitely possible to use statement_timeout to have PostgreSQL abort your long-running query as your say. If you go down this route you may want to disable the client-side socket timeout by settings CommandTimeout to 0, or at least set it to something longer than whatever statement_timeout is set to. Note that the client-side socket timeout also protects you against network drops which may leave you reading forever, which statement_timeout doesn’t do.

The only inaccurate thing in what you say is the last sentence… Npgsql’s CommandTimeout (30 by default) does cause your a read timeout at the client socket level, but this doesn’t (necessarily) affect your query on the server side.

Hope this clarifies things, otherwise feel free to ask for more details.

roji on Feb 6, 2017