npgsql: NRE when calling Complete() in NpgsqlBinaryImporter

The issue

I am running some load tests with Postgres (Program.cs) and import a lot of data using my project PostgreSQLCopyHelper, which is a thin wrapper over the Binary Importer of Npgsql.

For some imports I am seeing a rare Race Condition, that I cannot reliably reproduce. I have used no parallelism to avoid any race conditions, but I still get rare NullReferenceException in the NpgsqlBinaryImporter.Cleanup() method.

This race condition is known to occur, when I read the TODO in the NpgsqlConnector right (here): “There’s probably a race condition as the COPY operation may finish on its own during the next few lines”.

Now when the Complete Method has an exception and the Cleanup method throws the NullReferenceException, I lose all information about the original exception in the COPY method as it isn’t logged as far as I can see.

Would it be useful to log the `original exception in the NpgsqlBinaryImport.Complete method?

And would you accept a PR?

Something along the Lines:

/// <summary>
/// Completes the import operation. The writer is unusable after this operation.
/// </summary>
public ulong Complete()
{
    CheckReady();

    if (InMiddleOfRow)
    {
        Cancel();
        throw new InvalidOperationException("Binary importer closed in the middle of a row, cancelling import.");
    }

    try
    {
        WriteTrailer();
        _buf.Flush();
        _buf.EndCopyMode();

        _connector.SendMessage(CopyDoneMessage.Instance);
        var cmdComplete = Expect<CommandCompleteMessage>(_connector.ReadMessage());
        Expect<ReadyForQueryMessage>(_connector.ReadMessage());
        _state = ImporterState.Committed;
        return cmdComplete.Rows;
    }
    catch(Exception e)
    {
        Log.Error("COPY Complete failed", e);

        // An exception here will have already broken the connection etc.
        Cleanup();
        throw;
    }
}

Stacktrace

System.NullReferenceException: Object reference not set to an instance of an object.
   at Npgsql.NpgsqlBinaryImporter.Cleanup() in C:\projects\npgsql\src\Npgsql\NpgsqlBinaryImporter.cs:line 370
   at Npgsql.NpgsqlBinaryImporter.Complete() in C:\projects\npgsql\src\Npgsql\NpgsqlBinaryImporter.cs:line 311
   at PostgreSQLCopyHelper.PostgreSQLCopyHelper`1.SaveAll(NpgsqlConnection connection, IEnumerable`1 entities)
   at TimescaleExperiment.Sql.Client.LocalWeatherDataBatchProcessor.InternalWrite(IList`1 measurements) in D:\github\GermanWeatherDataExample\GermanWeatherData\TimescaleDB\Source\TimescaleExperiment\Sql\Client\LocalWeatherDataBatchProcessor.cs:line 81
   at TimescaleExperiment.Sql.Client.LocalWeatherDataBatchProcessor.Write(IList`1 measurements) in D:\github\GermanWeatherDataExample\GermanWeatherData\TimescaleDB\Source\TimescaleExperiment\Sql\Client\LocalWeatherDataBatchProcessor.cs:line 53

Further technical details

Npgsql version: 4.0.0 PostgreSQL version: 10.6 Operating system: Windows 10

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

Not sure what we want to do here, but one option (#2398) is to just guard against the NRE and surface the underlying exception.

I think the NRE guard is good enough. An early design decision was to handle any I/O exceptions and break the connector in one, low-level place (i.e. NpgsqlReaderBuffer.Ensure()), meaning that high-level layers (such as copy) must be ready to deal with already-broken connectors.

We should backport this to 4.0.6, no?

I was able to reproduce this issue without any concurrency.

var csb = new NpgsqlConnectionStringBuilder
{
    Host = "...",
    Username = "...",
    Password = "...",
    Database = "...",
    CommandTimeout = 3 // <--
};
using (var conn = new NpgsqlConnection(csb.ConnectionString))
{
    conn.Open();
    using (var tx = conn.BeginTransaction())
    {
        const int rowCount = 1000000;
        using (var cmd = conn.CreateCommand())
        {
            cmd.CommandText = $"CREATE TABLE public.master AS SELECT * FROM generate_series(1, {rowCount}) id";
            cmd.ExecuteNonQuery();
            cmd.CommandText = "ALTER TABLE public.master ADD CONSTRAINT master_pk PRIMARY KEY (id)";
            cmd.ExecuteNonQuery();
            cmd.CommandText = "CREATE TABLE public.detail (master_id integer NOT NULL, CONSTRAINT detail_fk FOREIGN KEY(master_id) REFERENCES master(id))";
            cmd.ExecuteNonQuery();
        }
        using (var writer = conn.BeginBinaryImport("COPY public.detail FROM STDIN BINARY"))
        {
            for (var i = 1; i <= rowCount; ++i)
            {
                writer.StartRow();
                writer.Write(i);
            }
            writer.Complete(); // <-- NullReferenceException
        }
        //tx.Commit();
    }
}

The NullReferenceException is raised in the SimpleFrontendMessage.Write (line 80) if the following conditions are met:

  1. The NpgsqlConnectionStringBuilder.CommandTimeout parameter is greater than 0.
  2. Any command is executed before importing data into the table.
  3. Time spent in the NpgsqlBinaryImporter.Complete line 304 (Expect<CommandCompleteMessage>(_connector.ReadMessage());) is greater, than the NpgsqlConnectionStringBuilder.CommandTimeout parameter value.

Npgsql 4.0.5 PostgreSQL 11.2 Ubuntu 18.04.2

@akusmin this kind of error has been reported several times in the past, and has usually been a result of an application concurrency error, where the same connection is being used at the same time by multiple threads. I’m not saying that there isn’t an Npgsql-side issue, but we need to see some sort of repro in order to work on it. Usually this kind of thing should be reproducible by running a tight loop that does the same action over and over (possibly from multiple threads) until the error is triggered.

OK.

I can recommend trying to reproduce in a small code sample (console app preferably), this is usually possible by doing an endless loop and repeating the action, and if indeed there’s a race condition in there it pops up sooner or later. This also allows us to reproduce the issue on our side and find out the root cause.

Closing this for now as suggested, but please report back with any results.