knex: Connection terminated unexpectedly (Postgresql / AWS Lambda)

Environment

Knex version: 0.20.8 Database + version: Postgres 9.5.14 OS: AWS Lambda - Node 12.x

Recently, I’ve been seeing upticks in a connection terminated unexpectedly error. I can’t find a discernible pattern at this point but unlike #3523, it doesn’t seem to occur after long periods of activity. I see instances that are minutes apart in some cases.

I added some logging to the Knex afterCreate method and am seeing two errors:

  1. The connection terminated unexpectedly error
  2. And Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed - this seems to happen on the 1st or second invocation after seeing the connection terminated unexpectedly error.

I’ve tried a number of configurations with Knex pooling:

  1. { pool: { min: 1, max: 1 } }
  2. { pool: { min: 0 }
  3. { pool: { min: 1 } }

I’ve also tried setting { pool: { idleTimeoutMillis: 300000 } } as per #3523.

Additional information:

  • I’m defining my Knex instance outside of my AWS handlers, so the pool should be reused per container.
  • I’m using the following for my afterCreate function – I’m not sure if there’s a better way to get at the underlying error?
function afterCreate(connection, done) {
  connection.on('error', (err) => {
    console.log('POSTGRES CONNECTION ERROR');
    console.error(err)
  });
  done()
}

Appreciate any help! This is driving me nuts 😃

UPDATE: the only change that I’ve made is migrating the lambda function to Node 12 from Node 8 as AWS is removing support for 8. Unfortunately, it doesn’t look like I can go back. My other functions in production are on Node 8 and not seeing any issues 🤔🤔🤔

ISSUE FOUND: for whatever reason, reverting back to Node 8.x fixes the issue. Still not certain on the underlying cause.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 15
  • Comments: 71 (10 by maintainers)

Commits related to this issue

Most upvoted comments

any solution? urgent for us. thanks

disconnect and reconnect a knex instance without destroying and re-creating

await knex.destroy();
knex.init();

should drop old connections and reinitialize pool. It is a mystery why pg doesn’t get any events of broken connections during the sleep.

We might also add an option to pool to force reaping cycle to run, which could remove all the old connections from pool: https://github.com/Vincit/tarn.js/issues/56

After digging into this more, it seems not possible to use prepared statements with any kind of transaction pooling on postgres. The two main types of pooling are session pooling and transaction pooling. Transaction pooling seems the most generally performant. Transaction pooling is what RDS Proxy uses. node-postgres or pg does not seem compatible with any kind of transaction pooling if using prepared statements, and Knex I believe always uses prepared statements. This discussion that I’ve started is probably better moved over to the pg repo. I’ll leave some links here:

@elhigu : when I said “query”, I meant “query a predicate method that returns synchronously”. I’m actually in the process of implementing an experimental branch in knex for this. If the experiment works, then the pg team can integrate the concept / expose a public API for it.

(Unfortunately, I’m encountering a minor roadblock at the moment. It appears that a handful of Postgres unit tests are attempting to connect to the wrong port. I’m not sure why this fix would suddenly uncover that issue)

I was able to stop ‘Connection terminated unexpectedly’ on Lambda Node.js 12.x/Knex 0.19.5/Aurora PostgreSQL by adding these knex constructor options:

acquireConnectionTimeout: 5000,
pool: {
  min: 0,
  max: 10,
  createTimeoutMillis: 8000,
  acquireTimeoutMillis: 8000,
  idleTimeoutMillis: 8000,
  reapIntervalMillis: 1000,
  createRetryIntervalMillis: 100,
  propagateCreateError: false
}

@EmilIvanov @elhigu

Forgot to update, yes set idleTimeoutMillis < NLB timeout + min: 0 works on the newer version of Knex.

My problem is, I use very old Knex which still using generic-pool instead of tarn, and idleTimeoutMillis actually belong to tarn. So I just update my Knex, and it works.

Ok removing progapateCreateError brought the Connection terminated unexpectedly errors back in Lambda production. My hacky fix is to do a test query on every lambda invocation

const knexTestQuery = async () => {
  try {
    // do a test query to make sure the knex connection pool is ready to go after lambda unfreeze
    knex('events').select('*').first()
  } catch (err) {
    // log but otherwise ignore errors here
    logger.warn('test query error', err.message)
  }
}

I believe the issue is that when lambda goes from frozen to thawed state and then tries to do a query, it has a chance of picking up a bad client from the pg pool (note: the knex instance is only created at lambda function allocation time, but a function may be frozen and thawed multiple times in the lifecycle of a function instance). Doing this test query will catch any bad clients and cause the pool to heal itself before my actual query runs. I have the same issue with ioredis, but ioredis provides APIs for connecting and disconnecting, so on every lambda invocation I start with redis.connect() and at the end I do redis.disconnect(). I couldn’t figure out how to disconnect and reconnect a knex instance without destroying and re-creating it so I’m doing the hacky test query method instead

@briandamaged not yet. I need to push some “volume” through the lambda in question in order to see if the issue is triggered. I have to run some testing tomorrow that should do the trick. Apologies for the delay.

If anyone else wants to give it a shot, it would be great to have some additional confirmation on whether it does or does not work 👍

Yes! I should be back at my computer in a Few hours.

🙏 On Feb 27, 2020, 10:50 AM -0700, Brian Lauber notifications@github.com, wrote:

@jamesdixon : I’ve created an experimental Knex build here: https://github.com/briandamaged/knex/tree/experiment/aws-disconnections Would it be possible for you to try to recreate the issue using this build? If it works, then this should help us figure out a more permanent solution. Thx! cc: @brianc @elhigu — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

@elhigu I’m not sure what you’re referring to specifically? There has definitely been a behavior change between Node 8 and Node 10 / 12 even if the underlying internals in Knex/pg have remained the same.

I mean that I remembered that there has been unresolved reports of knex not working properly with AWS lambda before this issue. But now that I searched for them looks like they were about transaction hanging https://github.com/knex/knex/issues/2445

So I stand corrected. It might not be related to this issue after all.

@briandamaged

@elhigu : It has it built-in, but the base implementation still relies upon certain Events being fired. These events attach a __knex__disposed field to the connection object to indicate that the connection has been closed. This field is then checked by the pool before handing off the connection.

Correct. Thats why I suggested that adding delay before query is made could allow node event loop to have time to recognize that connection has been ended / is broken already when it is still in the pool.

Though I don’t know how node’s TCP socket recognizes this. If it happens by TCP sockets keealive feature looks like the delay needs to be huge until it would notice that. https://stackoverflow.com/questions/18614176/why-nodejs-keepalive-does-not-seem-to-work-as-expected

So I suppose if this works with Node 8 that there must be also some other way to notice it.

So, I believe this is the reason that some end-users are still seeing this issue in knex. If the Events are not fired/handled quickly enough, then a connection can slip through the cracks. Fortunately, knex also provides a hook so that each Dialect can add its own connection validation logic. For Client_PG, this method currently just returns true. If the pg team adds a public method for checking the state of a Client, then we can update the Client_PG’s hook to query this state. Alternatively, we could have knex open the hood on the pg Client object and figure things out on its own. But, this will couple knex to the current pg implementation, which will be problematic in the future.

@brianc has been always really helpful with providing necassary APIs to be able to prevent that. If the problem really is recognizable inside pg driver, I’m sure we get some official way to check that 😃

I just wish there would be a way to check that out from connection object, without need to create an additional query. Adding dummy query before getting connection from pool would not be suitable default behaviour since it adds too much overhead. Though that could overridden in pool configuration when knex is used with lambda.

Also Tarn.js doesn’t currently support async validate functions (I started writing support for it, but it is not complete yet, it needed pretty big changes to make that possible).

Wow – GitHub has certainly been on the struggle buggy lately! HTTP 500 responses all over the place!

Anyhoo – the underlying issue seems to be that both the knex and pg pool implementations are relying upon Events to detect/evict closed connections. But, if these Events are not handled in time, then the pools can hand off closed connections.

@kibertoad mentioned that there were changes to the EventEmitter class between node 8 and node 10. I don’t know the details of these changes, but they could potentially impact the order in which Events get processed. If so, then this could also explain why some end-users are not observing the issue in node 8.

Another quick update on this front:

Currently, it’s looking like both "pg" and "knex" have the same bug within their respective pool implementations. Specifically: neither pool implementation appears to be re-validating the connection before handing it off to the caller. This is providing an opportunity for the connections to the closed silently while they are sitting idle within the pool.

So, some near-term steps:

  1. The "pg" team is currently trying to verify a proposed fix. If that works, then
  2. They’ll refactor the fix and make some of the details available in a public-facing API. (ie: rather than having to check the internal details of some deeply-nested objects)
  3. Afterwards, "knex" can be upgraded to leverage this public-facing API and fix the issue within its PG connection pooling.

Howdy! I’ll try to investigate this in the next day or so. (Sry – I’m currently assisting w/ a large merge request. I want to get that resolved before making additional changes to the code)