knex: Bug: Connection terminated unexpectedly (postgres)

Environment

Knex version: 0.20.1 Database + version: postgres:11.5 OS: Alpine Linux

Additional information:

  • Database: I’ve confirmed the bug to happen in postgres versions at least up from 9.5. The database is running on a container, currently postgres:11.5.
  • Knex environment: Knex is currently running on node v12, though I’ve confirmed the bug to happen on v8 and v10 too. It’s running on a container, currently node:12.13.0-alpine. The current pg version is 7.12.1; though I’ve seen this happen with previous versions, I haven’t documented this bit.
  • Knex configuration object:
({
  client: 'pg',
  connection: 'postgres://user:password@server:5432/db',
  pool: { min: 2, max: 10 }
})

Bug

I am getting a Connection terminated unexpectedly error in production, after long periods of inactivity (think +1h) during which no queries are made, causing the first query after that period to fail. Any query made after that first failing query will succeed.

This was first reported on issue #3258. In that issue, it was reported happening on knex 0.16.5, so this has been going on at least since then. It was also stated that a downgrade to 0.13.0 fixes the issue, which met the suggestion that maybe newer knex versions logged the error while previous ones didn’t. I’ve confirmed this not to be the case. Downgrading to knex 0.13.0 does in fact fix the issue, with queries being successful after long periods of inactivity.

A typical error message will look like this:

  • message: select \"user\".* from \"user\" where \"email\" ilike $1 - Connection terminated unexpectedly
  • stack: Error: Connection terminated unexpectedly\n at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:252:9)\n at Object.onceWrapper (events.js:299:28)\n at Connection.emit (events.js:210:5)\n at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:77:10)\n at Socket.emit (events.js:210:5)\n at TCP.<anonymous> (net.js:658:12)

node-postgres #1324 might help providing additional insight on the matter. #3046 might also be related.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 47
  • Comments: 61 (13 by maintainers)

Commits related to this issue

Most upvoted comments

I have good news! After further investigating the issue, I believe I can point towards a solution -I’ll leave that for last, as there’s a bit to unpack.

Unfortunately, I haven’t been able to reproduce the error locally, even with the same setup and compose configuration as the remote server. It is quite a strange thing, as the only difference I can appreciate between both environments is I’m running docker-compose locally, while the server is running a docker swarm. To reiterate, I’ve set up a repo that reproduces this error on a remote production server, but (somehow) not locally. I will post the logs below (last), so for the purpose of understanding them, I’ve made the repo available here. As you can see there, I’m making a first successful query right after the process starts, making a 2nd query after 20 minutes (which fails), and a 3rd query immediately after (which succeeds).

Some additional info, just for the sake of future reference:

  • It seems the idle time needed for the connection error to happen after querying is somewhere between 15 and 20 minutes. Queries after 15 minutes of inactivity will succeed; while after 20, they will fail.
  • knex 0.13.0 doesn’t show this issue, while every version after, including 0.14.0, does. However, on 0.14.0 a new connection won’t be acquired on the next query, so they will fail indefinitely, which is solved on 0.15.0.

And now, for the useful bit: it turns out pg is not emitting the connection error until after the query has been made, but before it has resolved. This is the sequence of events:

  1. knex public api query is made.
  2. knex’s dialects/postgres/index.js _query is called with a pg connection, which calls pg’s connection.query.
  3. pg’s connection emits an 'error' event, which is received by connection.on('error', cb) at knex’s dialects/postgres/index.js.
  4. pg’s connection.query calls its callback with the same error that had just been emitted, hence knex’s dialects/postgres/index.js _query rejects with said error.
  5. pg’s connection emits an 'end' event, which is received by connection.on('end', cb) at knex’s dialects/postgres/index.js.
  6. knex releases connection to pool.
  7. The next time knex receives a query to be made, it will acquire a new pg connection (acquireRawConnection) and succeed.

So essentially, it could be argued that maybe there’s an issue with the containerized production environment/setup, and/or with pg, which is not reporting a connection termination until the time at which a query is made -though I don’t know whether reporting it earlier would be feasible. But regardless, there’s a (very much fixable) issue with knex: if a connection errors out in the time that goes between a query is made and its resolution, the query will fail. I would imagine this would be the case not just with postgres, but with other dialects too.

Hence I would suggest that, in this event, a new connection is acquired and the query retried. It might make sense to pass that error (make the query fail with it), though, in the event of the connection having being acquired just before it was made. So, to recap, I’d propose that if a query errors out and the connection was just acquired for the purpose of the query, it rejects with the error, but otherwise it checks whether the connection has failed, and if so, acquires a new connection and retries the query.

Last, the logs. These are for v0.20.1, but I have logs for every minor version starting on 0.13.0.

  • First query, right after the process starts:
2019-11-09T09:22:02.057558944Z Elapsed: 0 ms
2019-11-09T09:22:02.058104661Z 2019-11-09T09:22:02.057Z knex:client acquired connection from pool: __knexUid1
2019-11-09T09:22:02.059072595Z {
2019-11-09T09:22:02.059140593Z method: 'select',
2019-11-09T09:22:02.059150803Z options: {},
2019-11-09T09:22:02.059156884Z timeout: false,
2019-11-09T09:22:02.059162835Z cancelOnTimeout: false,
2019-11-09T09:22:02.059169678Z bindings: [ 'example@example.com' ],
2019-11-09T09:22:02.059176932Z __knexQueryUid: '0ab6fbe6-f4d0-497a-87b0-057033964950',
2019-11-09T09:22:02.059183233Z sql: 'select * from "user" where "email" = ?'
2019-11-09T09:22:02.059191860Z }
2019-11-09T09:22:02.059483679Z 2019-11-09T09:22:02.059Z knex:query select * from "user" where "email" = ? trx2
2019-11-09T09:22:02.059543883Z 2019-11-09T09:22:02.059Z knex:bindings [ 'example@example.com' ] trx2
2019-11-09T09:22:02.063050947Z 2019-11-09T09:22:02.062Z knex:client releasing connection to pool: __knexUid1
2019-11-09T09:22:02.063380849Z Result []
  • Second query, after 20 minutes:
2019-11-09T09:42:02.067012525Z Elapsed: 1200000 ms
2019-11-09T09:42:02.071066942Z 2019-11-09T09:42:02.069Z knex:client acquired connection from pool: __knexUid1
2019-11-09T09:42:02.071878431Z {
2019-11-09T09:42:02.071909690Z method: 'select',
2019-11-09T09:42:02.071916663Z options: {},
2019-11-09T09:42:02.071921983Z timeout: false,
2019-11-09T09:42:02.071926913Z cancelOnTimeout: false,
2019-11-09T09:42:02.071931722Z bindings: [ 'example@example.com' ],
2019-11-09T09:42:02.071936431Z __knexQueryUid: '229e5c4a-9ed3-4d82-a25e-b048e7cb9c0d',
2019-11-09T09:42:02.071941269Z sql: 'select * from "user" where "email" = ?'
2019-11-09T09:42:02.071946579Z }
2019-11-09T09:42:02.072055515Z 2019-11-09T09:42:02.071Z knex:query select * from "user" where "email" = ? trx2
2019-11-09T09:42:02.072125507Z 2019-11-09T09:42:02.072Z knex:bindings [ 'example@example.com' ] trx2
2019-11-09T09:42:02.081042312Z 2019-11-09T09:42:02.080Z knex:client releasing connection to pool: __knexUid1
2019-11-09T09:42:02.082578957Z Error: Connection terminated unexpectedly
2019-11-09T09:42:02.082619874Z at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:252:9)
2019-11-09T09:42:02.082627337Z at Object.onceWrapper (events.js:299:28)
2019-11-09T09:42:02.082633188Z at Connection.emit (events.js:210:5)
2019-11-09T09:42:02.082638899Z at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:77:10)
2019-11-09T09:42:02.082645932Z at Socket.emit (events.js:210:5)
2019-11-09T09:42:02.082650612Z at TCP.<anonymous> (net.js:658:12)
  • Third query, right after the 2nd:
2019-11-09T09:42:02.084190783Z Elapsed: 0 ms
2019-11-09T09:42:02.084868650Z Connection Error: Connection ended unexpectedly
2019-11-09T09:42:02.097410334Z 2019-11-09T09:42:02.097Z knex:client acquired connection from pool: __knexUid3
2019-11-09T09:42:02.098051381Z {
2019-11-09T09:42:02.098071179Z method: 'select',
2019-11-09T09:42:02.098076118Z options: {},
2019-11-09T09:42:02.098080576Z timeout: false,
2019-11-09T09:42:02.098085165Z cancelOnTimeout: false,
2019-11-09T09:42:02.098089503Z bindings: [ 'example@example.com' ],
2019-11-09T09:42:02.098094572Z __knexQueryUid: '6e702fd9-b0d5-4c8e-9ee7-0b6404fec6d6',
2019-11-09T09:42:02.098099522Z sql: 'select * from "user" where "email" = ?'
2019-11-09T09:42:02.098104000Z }
2019-11-09T09:42:02.098367957Z 2019-11-09T09:42:02.098Z knex:query select * from "user" where "email" = ? undefined
2019-11-09T09:42:02.098583893Z 2019-11-09T09:42:02.098Z knex:bindings [ 'example@example.com' ] undefined
2019-11-09T09:42:02.103675485Z 2019-11-09T09:42:02.103Z knex:client releasing connection to pool: __knexUid3
2019-11-09T09:42:02.104059078Z Result []

I got this problem today.

If I use pooling and let the app sit for a couple of hours, then make a request I get “Connection Error: Connection ended unexpectedly”. After that first failing request the second request works.

@lynxtaa I changed my pool settings to the following and the error has not occurred in a couple days, so I guess this fixes the problem. I’ll comment again if it resurfaces. Thanks.

pool: {
    min: 0,
    max: 5,
    acquireTimeoutMillis: 60000,
    idleTimeoutMillis: 600000,
}

Any update on this? I see this issue has been open for a year.

I’m facing the same problem while running a nodejs app in google cloud-run.

Queries are failing in production when the service has been left idle for a long time. The error message is

select ... - Connection terminated unexpectedly

Here are my connection pool settings for Knex.

pool: {
    min: 1,
    max: 5,
    acquireTimeoutMillis: 60000,
    idleTimeoutMillis: 600000,
}

I think this might be the same issue: https://github.com/knex/knex/issues/5186

very much still an issue, seems to be happening to all M1 macs, suggested workarounds not resolving

@rizkyramadhan

I might have found something, correct me please If I’m wrong because this is my first time digging the code:

On the method acquireRawConnection() on postgres/index.js dialect, we are using an event handler for the event type “end” which I think is incorrect because we dispose of the error to the pipe think this cause the “idle” error we currently see (correct me here If I’m wrong and I’m not taking in count other use cases).

dialect/postgres/index.js line 78

connection.on('end', (err) => {
    connection.__knex__disposed = err || 'Connection ended unexpectedly';
});

So, taking in mind that piece of code, and having this snippet from the documentation

pg.Client documentation (https://node-postgres.com/api/client)

client.on('end') => void
//When the client disconnects from the PostgreSQL server it will emit an end event once.

We see that Connection ended unexpectedly always be called given that client.end('end') does not return any error nor callback. In our Knex code (first snippet), we are expecting a parameter err which never exists. So, the error message we all see comes from this place, ALWAYS. Given that simply err does not exist in the first place will always fall back to Connection ended unexpectedly

Having all of that in mind, and reading docs from pg, this event is emitted only on a client disconnection. What could be happening here is that we left the connection “open” on the pool but closed on the Postgres pool, so when we try to communicate with the connection using the pg driver we receive that error.

Aside from this, also we should extend the validate connection method inside the pool to check if the connection already exists or is it valid. As far as I saw on the code while digging, correct me if this is on purpose, the validate method for the pool config inside client.js is quite silly (not using connection object to check if it’s valid)

image

Maybe in conjunction with the end event listening commented on the first part of this comment, and the validate extension to something like the following could do the job:

validateConnection(connection) {
  return connection && connection.connected;
}

Currently, I’m trying to replicate it using the repository posted by @rafamel but without luck (I’m waiting for idle connections up to 20 minutes but nothing happens)(see https://github.com/knex/knex/issues/3523#issuecomment-864632061)

Edit motive: multiple typos (English is not my first language) and snippet styling to javascript

I’m on the same boat as everyone here, first request fails but after that request, everything works as expected

Seems like min : 0gives us a solution given that tarn.js will remove every idle instance given that the minimum number of instances in the pool is determined by the minKeep and maxDestroy instantiate inside the check() method on the Pool object. (https://github.com/Vincit/tarn.js/blob/master/src/Pool.ts#L247)

So, passing a 0 will not retain instances inside Pool object and you won’t have idle instances. So, this is more like a bandaid to the problem than actually a solution for it but works at the end since the check() function inside Pool.js on tarn.js will use this minimum size to keep instances alive and given that the minimum size is 0 will retain 0 instances alive

I’ve solved it by setting minimal pool size to zero: pool: { min: 0, max: 10 }

However, it might be that I didn’t look carefully enough, but I think it would be very good if this information regarding the handling of broken connections vs. errored connections had a prominent place in knex’s documentation in order for devs to have certainty on the expectations. On a side note, I also couldn’t locate the default values for the configuration options in the docs, which would also be helpful.

Certainly that would be good to have 👍 Pull requests for that would be welcome too 😃

Hi, thank you for taking the time 😉 In general, I can see where you’re coming from, however:

Actually connection error might happen at any time (or any other surprise error) and app should be written to handle unexpected errors causing queries to fail. Automatic retrying will not resolve it, except for some very specific cases, where connection error gets fixed automatically in some time frame and when transactions are not used.

I’m failing to get the point here. When a connection error occurs, knex will discard it and acquire a new one, hence there’s a expectation that connection errors (not other kind of query errors, but yes, connection errors) will be handled, when possible to reconnect, by knex. This case fits there, with the only difference that the connection happens to emit the error after the query has been made (but before it’s been resolved). Hence my proposal to log the connection error, discard the query error, get a new connection, and retry the query -not for every case, but only for this particular case in which the connection has emitted an error after querying but before resolving. This seems the most consistent to me with the current behavior otherwise, given the above expectation.

Retrying would also cause other problems, like a delay for app to be able to notice that your DB connection is down (and after noticing you would still have to have that error management for unexpectedly failed query).

On most cases, I’d assume that the dialect’s connection will emit a connection error at any time (so not during an ongoing query), in which case, the current behavior applies, so no harm done. As pointed out, this would only apply in cases when the connection error is emitted during a query, hence both because the # of times that would happen by random chance is statistically very low, and because the delay would be realistically unnoticeable (given it’s not something that accumulates, but rather a singularly occurring event), I’d say the downside would be more than justified.

One can probably work around this issue by setting minimum pool size to 0 and set idle timeout to be like 5 minutes or so. In that way connections will be discarded from pool before that.

It might be sensible to set defaults that avoid this issue or, at least, document it in some way. But regardless, I am somewhat confused with the current defaults. I’m assuming when you say set idle timeout to be like 5 minutes or so you’re referring to tarn.js idleTimeoutMillis, to be set in the pool key object within knex’s configuration object. Unless knex is passing a different default value (is it?), idleTimeoutMillis seems to default to 30000ms, 30 seconds, which should have avoided this error altogether. Any insight on this?

Thank you very much. Those findings are really useful. Looks like things are happening as I expected so that error is not either happening or registered before it is taken away from pool and data is actually sent over the connection.

One can probably work around this issue by setting minimum pool size to 0 and set idle timeout to be like 5 minutes or so. In that way connections will be discarded from pool before that.

But regardless, there’s a (very much fixable) issue with knex: if a connection errors out in the time that goes between a query is made and its resolution, the query will fail.

Actually connection error might happen at any time (or any other surprise error) and app should be written to handle unexpected errors causing queries to fail. Automatic retrying will not resolve it, except for some very specific cases, where connection error gets fixed automatically in some time frame and when transactions are not used.

So knex is intentionally not trying to retry failed queries automatically. Retrying would also cause other problems, like a delay for app to be able to notice that your DB connection is down (and after noticing you would still have to have that error management for unexpectedly failed query).

Still an issue. I suggest that the docs be changed to reflect the suggestions given above.

My stack trace:

{"level":50,"time":1614497262294,"pid":1,"hostname":"843348d00255","reqId":194,"errno":-110,"code":"ETIMEDOUT","syscall":"read","stack":"Error: read ETIMEDOUT\n    at TCP.onStreamRead (internal/stream_base_commons.js:209:20)","type":"Error","msg":"select...REDACTED_SQL_QUERY) - read ETIMEDOUT"}

@NelsonFrancisco I’ve set the min pool size to 0, did a test again this morning, it seems works. I don’t see that error message from log again

I have explained why knex does not retry on connection errors or on any other errors so many times to so many people that I should probably just find a link to some older discussion 😬

'm failing to get the point here. When a connection error occurs, knex will discard it and acquire a new one, hence there’s a expectation that connection errors (not other kind of query errors, but yes, connection errors) will be handled, when possible to reconnect, by knex.

If there is a connection error when connection is in pool we know that it is not used by anyone so it is owned by knex and we can do what ever we want to with it (actually we don’t discard it until someone tries to get it from pool and validation phase sees that connection was broken while being in the pool).

Actually now that I think of it if I recall correctly even pool does not retry for connection errors. So if connection is broken when connection was in pool, then it will be discarded and fresh connection is returned instead, but if that fresh connection fails because of connection error, it will not be retried, but the error is passed to the user. This is also for the same reason, because we don’t want knex to hide those connection errors which happen during queries.

Hence my proposal to log the connection error, discard the query error, get a new connection, and retry the query -not for every case, but only for this particular case in which the connection has emitted an error after querying but before resolving. This seems the most consistent to me with the current behavior otherwise, given the above expectation.

It doesn’t make sense to make code more complicated and add single retry for this single case. Expectation is that when connection is got from pool it is usable… if there is connection error happening afterwards, then it is passed to the user. We don’t want to mask errors, you still can have them coming at any time which is not even related to this issue and then you would need to retry for that case too (because it will be hard to distinct from other case). Also as I mentioned above you would still need to have error handling for the case where connection error comes and it takes longer that just one query or if it happens during transaction.

For this particular case I would love to find fix the root cause of the problem, but until then setting pool min to 0 should do just fine. It is pretty uncommon case where database is without use for hours and that you would still want to have pool populated with connections during that idle time.

On most cases, I’d assume that the dialect’s connection will emit a connection error at any time (so not during an ongoing query), in which case, the current behavior applies, so no harm done. As pointed out, this would only apply in cases when the connection error is emitted during a query, hence both because the # of times that would happen by random chance is statistically very low, and because the delay would be realistically unnoticeable (given it’s not something that accumulates, but rather a singularly occurring event), I’d say the downside would be more than justified.

I really didn’t understand what you tried to say with that. Only benefit that I see with having that retrying logic is that it will work as a hack to kind of fix something that does not seem to be a knex bug. Preventing pool from keeping idle connections there for long periods of time is a lot better workaround for this (happens with some very specific environment).

Unless knex is passing a different default value (is it?), idleTimeoutMillis seems to default to 30000ms, 30 seconds, which should have avoided this error altogether. Any insight on this?

Pool min size 0 was the key part of that work around. If pool size min is bigger that 0 then idle timeout will not do anything after pool has already reached its minimum size.