npgsql: NRE when calling Complete() in NpgsqlBinaryImporter
The issue
I am running some load tests with Postgres (Program.cs) and import a lot of data using my project PostgreSQLCopyHelper, which is a thin wrapper over the Binary Importer of Npgsql.
For some imports I am seeing a rare Race Condition, that I cannot reliably reproduce. I have used no parallelism to avoid any race conditions, but I still get rare NullReferenceException in the NpgsqlBinaryImporter.Cleanup() method.
This race condition is known to occur, when I read the TODO in the NpgsqlConnector right (here): “There’s probably a race condition as the COPY operation may finish on its own during the next few lines”.
Now when the Complete Method has an exception and the Cleanup method throws the NullReferenceException, I lose all information about the original exception in the COPY method as it isn’t logged as far as I can see.
Would it be useful to log the `original exception in the NpgsqlBinaryImport.Complete method?
And would you accept a PR?
Something along the Lines:
/// <summary>
/// Completes the import operation. The writer is unusable after this operation.
/// </summary>
public ulong Complete()
{
CheckReady();
if (InMiddleOfRow)
{
Cancel();
throw new InvalidOperationException("Binary importer closed in the middle of a row, cancelling import.");
}
try
{
WriteTrailer();
_buf.Flush();
_buf.EndCopyMode();
_connector.SendMessage(CopyDoneMessage.Instance);
var cmdComplete = Expect<CommandCompleteMessage>(_connector.ReadMessage());
Expect<ReadyForQueryMessage>(_connector.ReadMessage());
_state = ImporterState.Committed;
return cmdComplete.Rows;
}
catch(Exception e)
{
Log.Error("COPY Complete failed", e);
// An exception here will have already broken the connection etc.
Cleanup();
throw;
}
}
Stacktrace
System.NullReferenceException: Object reference not set to an instance of an object.
at Npgsql.NpgsqlBinaryImporter.Cleanup() in C:\projects\npgsql\src\Npgsql\NpgsqlBinaryImporter.cs:line 370
at Npgsql.NpgsqlBinaryImporter.Complete() in C:\projects\npgsql\src\Npgsql\NpgsqlBinaryImporter.cs:line 311
at PostgreSQLCopyHelper.PostgreSQLCopyHelper`1.SaveAll(NpgsqlConnection connection, IEnumerable`1 entities)
at TimescaleExperiment.Sql.Client.LocalWeatherDataBatchProcessor.InternalWrite(IList`1 measurements) in D:\github\GermanWeatherDataExample\GermanWeatherData\TimescaleDB\Source\TimescaleExperiment\Sql\Client\LocalWeatherDataBatchProcessor.cs:line 81
at TimescaleExperiment.Sql.Client.LocalWeatherDataBatchProcessor.Write(IList`1 measurements) in D:\github\GermanWeatherDataExample\GermanWeatherData\TimescaleDB\Source\TimescaleExperiment\Sql\Client\LocalWeatherDataBatchProcessor.cs:line 53
Further technical details
Npgsql version: 4.0.0 PostgreSQL version: 10.6 Operating system: Windows 10
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 18 (9 by maintainers)
I think the NRE guard is good enough. An early design decision was to handle any I/O exceptions and break the connector in one, low-level place (i.e.
NpgsqlReaderBuffer.Ensure()), meaning that high-level layers (such as copy) must be ready to deal with already-broken connectors.We should backport this to 4.0.6, no?
I was able to reproduce this issue without any concurrency.
The
NullReferenceExceptionis raised in theSimpleFrontendMessage.Write(line 80) if the following conditions are met:NpgsqlConnectionStringBuilder.CommandTimeoutparameter is greater than 0.NpgsqlBinaryImporter.Completeline 304 (Expect<CommandCompleteMessage>(_connector.ReadMessage());) is greater, than theNpgsqlConnectionStringBuilder.CommandTimeoutparameter value.Npgsql 4.0.5 PostgreSQL 11.2 Ubuntu 18.04.2
@akusmin this kind of error has been reported several times in the past, and has usually been a result of an application concurrency error, where the same connection is being used at the same time by multiple threads. I’m not saying that there isn’t an Npgsql-side issue, but we need to see some sort of repro in order to work on it. Usually this kind of thing should be reproducible by running a tight loop that does the same action over and over (possibly from multiple threads) until the error is triggered.
OK.
I can recommend trying to reproduce in a small code sample (console app preferably), this is usually possible by doing an endless loop and repeating the action, and if indeed there’s a race condition in there it pops up sooner or later. This also allows us to reproduce the issue on our side and find out the root cause.
Closing this for now as suggested, but please report back with any results.