knex: Connection terminated unexpectedly (Postgresql / AWS Lambda)
Environment
Knex version: 0.20.8 Database + version: Postgres 9.5.14 OS: AWS Lambda - Node 12.x
Recently, I’ve been seeing upticks in a connection terminated unexpectedly error. I can’t find a discernible pattern at this point but unlike #3523, it doesn’t seem to occur after long periods of activity. I see instances that are minutes apart in some cases.
I added some logging to the Knex afterCreate method and am seeing two errors:
- The
connection terminated unexpectedlyerror - And
Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed- this seems to happen on the 1st or second invocation after seeing theconnection terminated unexpectedlyerror.
I’ve tried a number of configurations with Knex pooling:
{ pool: { min: 1, max: 1 } }{ pool: { min: 0 }{ pool: { min: 1 } }
I’ve also tried setting { pool: { idleTimeoutMillis: 300000 } } as per #3523.
Additional information:
- I’m defining my Knex instance outside of my AWS handlers, so the pool should be reused per container.
- I’m using the following for my
afterCreatefunction – I’m not sure if there’s a better way to get at the underlying error?
function afterCreate(connection, done) {
connection.on('error', (err) => {
console.log('POSTGRES CONNECTION ERROR');
console.error(err)
});
done()
}
Appreciate any help! This is driving me nuts 😃
UPDATE: the only change that I’ve made is migrating the lambda function to Node 12 from Node 8 as AWS is removing support for 8. Unfortunately, it doesn’t look like I can go back. My other functions in production are on Node 8 and not seeing any issues 🤔🤔🤔
ISSUE FOUND: for whatever reason, reverting back to Node 8.x fixes the issue. Still not certain on the underlying cause.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 15
- Comments: 71 (10 by maintainers)
Commits related to this issue
- Work-around for connection disposal bug in pooling logic... ... See this conversation thread for more info: https://github.com/knex/knex/issues/3636#issuecomment-592005391 — committed to briandamaged/knex by briandamaged 4 years ago
- Mark missing test cases as "pending" rather than "passed" (#3695) * Missing test cases are marked as "pending" rather than "passed" * Work-around for connection disposal bug in pooling logic... S... — committed to knex/knex by briandamaged 4 years ago
any solution? urgent for us. thanks
should drop old connections and reinitialize pool. It is a mystery why
pgdoesn’t get any events of broken connections during the sleep.We might also add an option to pool to force reaping cycle to run, which could remove all the old connections from pool: https://github.com/Vincit/tarn.js/issues/56
After digging into this more, it seems not possible to use prepared statements with any kind of transaction pooling on postgres. The two main types of pooling are session pooling and transaction pooling. Transaction pooling seems the most generally performant. Transaction pooling is what RDS Proxy uses.
node-postgresorpgdoes not seem compatible with any kind of transaction pooling if using prepared statements, and Knex I believe always uses prepared statements. This discussion that I’ve started is probably better moved over to the pg repo. I’ll leave some links here:@elhigu : when I said “query”, I meant “query a predicate method that returns synchronously”. I’m actually in the process of implementing an experimental branch in
knexfor this. If the experiment works, then thepgteam can integrate the concept / expose a public API for it.(Unfortunately, I’m encountering a minor roadblock at the moment. It appears that a handful of Postgres unit tests are attempting to connect to the wrong port. I’m not sure why this fix would suddenly uncover that issue)
I was able to stop ‘Connection terminated unexpectedly’ on Lambda Node.js 12.x/Knex 0.19.5/Aurora PostgreSQL by adding these
knexconstructor options:@EmilIvanov @elhigu
Forgot to update, yes set
idleTimeoutMillis< NLB timeout +min: 0works on the newer version of Knex.My problem is, I use very old Knex which still using
generic-poolinstead oftarn, andidleTimeoutMillisactually belong totarn. So I just update my Knex, and it works.Ok removing
progapateCreateErrorbrought theConnection terminated unexpectedlyerrors back in Lambda production. My hacky fix is to do a test query on every lambda invocationI believe the issue is that when lambda goes from frozen to thawed state and then tries to do a query, it has a chance of picking up a bad client from the pg pool (note: the knex instance is only created at lambda function allocation time, but a function may be frozen and thawed multiple times in the lifecycle of a function instance). Doing this test query will catch any bad clients and cause the pool to heal itself before my actual query runs. I have the same issue with
ioredis, butioredisprovides APIs for connecting and disconnecting, so on every lambda invocation I start withredis.connect()and at the end I doredis.disconnect(). I couldn’t figure out how to disconnect and reconnect a knex instance without destroying and re-creating it so I’m doing the hacky test query method instead@elhigu thanks, I got the advice from this blog post: https://deniapps.com/blog/setup-aws-lambda-to-use-amazon-rds-proxy
@briandamaged not yet. I need to push some “volume” through the lambda in question in order to see if the issue is triggered. I have to run some testing tomorrow that should do the trick. Apologies for the delay.
If anyone else wants to give it a shot, it would be great to have some additional confirmation on whether it does or does not work 👍
Yes! I should be back at my computer in a Few hours.
🙏 On Feb 27, 2020, 10:50 AM -0700, Brian Lauber notifications@github.com, wrote:
I mean that I remembered that there has been unresolved reports of knex not working properly with AWS lambda before this issue. But now that I searched for them looks like they were about transaction hanging https://github.com/knex/knex/issues/2445
So I stand corrected. It might not be related to this issue after all.
@briandamaged
Correct. Thats why I suggested that adding delay before query is made could allow node event loop to have time to recognize that connection has been ended / is broken already when it is still in the pool.
Though I don’t know how node’s TCP socket recognizes this. If it happens by TCP sockets keealive feature looks like the delay needs to be huge until it would notice that. https://stackoverflow.com/questions/18614176/why-nodejs-keepalive-does-not-seem-to-work-as-expected
So I suppose if this works with Node 8 that there must be also some other way to notice it.
@brianc has been always really helpful with providing necassary APIs to be able to prevent that. If the problem really is recognizable inside pg driver, I’m sure we get some official way to check that 😃
I just wish there would be a way to check that out from connection object, without need to create an additional query. Adding dummy query before getting connection from pool would not be suitable default behaviour since it adds too much overhead. Though that could overridden in pool configuration when knex is used with lambda.
Also Tarn.js doesn’t currently support async validate functions (I started writing support for it, but it is not complete yet, it needed pretty big changes to make that possible).
Wow – GitHub has certainly been on the struggle buggy lately! HTTP 500 responses all over the place!
Anyhoo – the underlying issue seems to be that both the
knexandpgpool implementations are relying upon Events to detect/evict closed connections. But, if these Events are not handled in time, then the pools can hand off closed connections.@kibertoad mentioned that there were changes to the
EventEmitterclass betweennode 8andnode 10. I don’t know the details of these changes, but they could potentially impact the order in which Events get processed. If so, then this could also explain why some end-users are not observing the issue innode 8.Another quick update on this front:
Currently, it’s looking like both
"pg"and"knex"have the same bug within their respective pool implementations. Specifically: neither pool implementation appears to be re-validating the connection before handing it off to the caller. This is providing an opportunity for the connections to the closed silently while they are sitting idle within the pool.So, some near-term steps:
"pg"team is currently trying to verify a proposed fix. If that works, then"knex"can be upgraded to leverage this public-facing API and fix the issue within its PG connection pooling.Howdy! I’ll try to investigate this in the next day or so. (Sry – I’m currently assisting w/ a large merge request. I want to get that resolved before making additional changes to the code)