typeorm: Broken Postgres clients are released back into the pool, but should be removed

Issue type:

[x] bug report

Database system/driver:

[x] postgres

It may impact other drivers if they have similar semantics/expectations as pg.Pool.

TypeORM version:

[x] latest

Explanation of the problem

Currently in TypeORM, if a client suffers an unrecoverable error - for example, if the underlying connection goes away, such as during a DB failover - there is no protection in place to stop that client being added back into the pg.Pool. This broken client will then be handed out in future even though it’ll never be able to execute a successful query.

Although pg.Pool itself does listen for error events on clients within the pool - and actively removes any which do emit errors - it doesn’t catch everything. It is considered the responsibility of the user of the pool to release known broken clients by calling client.release(true). The truthy argument tells the pool to destroy the connection instead of adding it back to the pool

https://node-postgres.com/api/pool#releaseCallback

The releaseCallback releases an acquired client back to the pool. If you pass a truthy value in the err position to the callback, instead of releasing the client to the pool, the pool will be instructed to disconnect and destroy this client, leaving a space within itself for a new client.

If a client’s connection does break, it’s very difficult to debug, as a handful of queries will begin failing with an error like Error: Client has encountered a connection error and is not queryable while others will continue executing fine.

There is further discussion about the impact of this in the node-postgres repo: https://github.com/brianc/node-postgres/issues/1942

Steps to reproduce or a small repository showing the problem

  • Set pool size to 1
  • Run a small loop which repeatedly performs a query, e.g.
while (true) {
  try {
    await getManager().query('select pg_sleep(4)')
    console.log('success')
  } catch (err) {
    console.log('error', err)
  }

  console.log('done')
  await sleep(1000)
}
  • Wait for the query to run at least once. This should allocate a connection from the pool. Subsequent invocations of the query should use the same connection.
  • Kill the connection while query is running. On a Mac or Linux, this is easily done by finding and killing the process, e.g. ps aux | grep postgres | grep SELECT
  • The loop continue to throw errors, as the broken connection has been returned to the pool and continues to be handed out. This shouldn’t happen: we should have told pg.Pool the connection is broken so it can be removed from the pool, so the next query should get a new, unbroken, connection

I believe the reason this hasn’t been noticed before (at least not that I could see) is because it’s really only likely to happen if the actual database connection breaks. The majority of QueryFailedErrors are caused by dodgy SQL etc, none of which will render a client unusable. And, usually, if your database is killing connections, you’ve got other problems to think about 😅

We only noticed it because we run PgBouncer in between TypeORM and our Postgres server. When we redeployed PgBouncer, it would kill some of the active client connections in the pool, but because pg.Pool never found out about it, those connections remained in the pool indefinitely, causing a steady stream of errors even though everything else was fine.

Fix

I have a working fix here: https://github.com/loyaltylion/typeorm/commit/6bd52e04ffa5ba229874eecaac9c78b1628eb1ae

If this fix looks suitable, I’d be happy to create a PR to get this merged. It only applies to postgres, but could be extended to other drivers if we think they’d benefit.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 20
  • Comments: 25 (2 by maintainers)

Commits related to this issue

Most upvoted comments

any new updates?

Any help now is greatly appreciated.

What worked for us is to check for broken connections at the beginning of every invocation. This will run a simple query on every client of the pool, and if that fails, will reconnect to the database.

Call this method from inside the event handler function.

const reconnectToDatabase = async () => {
  const connection = getConnection();
  const driver = connection.driver as any;
  for (const client of driver.master._clients) {
    try {
      await client.query('SELECT 1');
    } catch (error) {
      console.info('Reconnecting ...');
      await getConnection().driver.disconnect();
      await getConnection().driver.connect();
      break;
    }
  }
}

(FYI, @brianc - maybe that helps for node-postgres side of the story 😃 )

I’ve opened a PR #7792 which passes the error back to the postgres pool when releasing the connections.

However, for reasons listed above it’s pretty difficult to test this beyond manually doing things. We also don’t really do mocks.

Anyone else want to test it, too?

Any updates on this one?

Any updates on this?

I am not entirely convinced that ‘pg’ is the right layer for fixing this.

One possible workaround for this issue is to have Typeorm pass any connection-level errors back to ‘pg’, so that the broken connections can be removed from the pool. This behavior is not universally correct, as it would also close connections that hit for example a unique constraint, which can be the wrong thing to do if your code relies heavily on ignoring such errors at a high rate. Maybe there’s a reliable way to tell ‘connection interrupted’ type of errors from more everyday database errors, but I haven’t looked into it any deeper.

Here’s a link to my workaround for this particular issue.

It successfully cleans up dead connections after performing a database failover on AWS Aurora Postgres, which hung around forever without the workaround.

This didn’t work out for us. The error is still being thrown and not caught by the try-catch block in the fix mentioned above. Despite a lot of digging through code and GitHub issues, we’re still on it to figure out what exactly is going on, but I’m having a hunch that this change here could be causing the issue:

https://github.com/brianc/node-postgres/issues/1324

Knex is having a similar, maybe even the same, issue: https://github.com/knex/knex/issues/3523

BTW, this is happening in our AWS Lambda environment, using Node 10 & 12. @seriousManual also mentioned a serverless environment. @592da Same for you?

Head over to node-postgres for hot updates on the topic: https://github.com/brianc/node-postgres/issues/2112

@michaelseibt please update - having this issue over production @pleerock this issue is serious https://github.com/typeorm/typeorm/issues/2623#issuecomment-557819167 #5387 #5223 #5112

Please advise.