pg-boss: Unexpected node shutdown when database is abruptly shutdown
Hi there! We use this awesome library as our jobs provider with Wasp. 😄 We recently had a user notice that when they deleted their DB on Fly.io, the app came crashing down. Note: Wasp does make use of boss.on('error', error => console.error(error)) as recommended.
Strangely, when I tried to reproduce locally by just shutting Postgres down when the app was running with cron jobs, I got the following error but it did not die:
Server(stderr): error: terminating connection due to administrator command
So it seems like it noticed and handled it gracefully.
But then when I deployed it to Fly.io and killed the DB, I got the following (different) error:
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]Error: Connection terminated unexpectedly
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] at Connection.<anonymous> (/app/server/node_modules/pg/lib/client.js:131:73)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] at Object.onceWrapper (node:events:627:28)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] at Connection.emit (node:events:513:28)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] at Socket.<anonymous> (/app/server/node_modules/pg/lib/connection.js:112:12)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] at Socket.emit (node:events:525:35)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] at endReadableNT (node:internal/streams/readable:1359:12)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] client: Client {
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] _events: [Object: null prototype] { error: [Function (anonymous)] },
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] _eventsCount: 1,
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] _maxListeners: undefined,
2023-02-01T19:58:32Z app[f07d5e6d] mia [info] connectionParameters:
...
This caused the node process to die, and it only happens when pg-boss is used and jobs are enabled (Express-only apps are fine). Is it possible that even though pg-boss is handling the error event: https://github.com/timgit/pg-boss/blob/master/src/db.js#L15 and we are, that somehow a pg error is escaping and bubbling up to the top when the DB is killed?
Happy to try to do anything needed to narrow in on this, just let me know! Thanks!
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (5 by maintainers)
Thanks for the support!
Following is an example of adding retries based on ECONNREFUSED. It’s simple, and given the number of maintenance jobs, it fails pretty fast, so adjust as you think appropriate.
Hey @timgit, I was just starting to think about it more and I’m wondering what we can do as a stopgap in this scenario. Do you think the best case is for me to try and catch/decode the type of
pgerror and shutpg-bossdown if it is critical (sort of what you replied in your first comment)? Or, assuming we even can fix thepgerror leak, would just letting it keep trying on this type of error be the ideal state (maybe not)? Do we need to change the internalpg-bossDB status of the pool to something other than opened in this case maybe?Per the untrapped exception in
Timekeeper.getCronTimein the call tothis.db.executeSql, just to confirm- given the error handling they propose setting up, we would expect a call tothis.pool.querythat has an'ECONNREFUSED'to emit an error we can handle instead of blowing up? And if it did, thenpg-bossside would be all good? Trying to think how I can best distill the problem to them using onlypgif possible in an example app. Thanks so much!BTW You should hopefully see Wasp as a backer now as a small thanks for all the great work you are doing for the community with this OSS. 🚀
Thanks for the sample project. I’ll try to repro.
This looks related: https://github.com/brianc/node-postgres/issues/2439
Thanks for the fix and new release with this included, @timgit! We at Wasp really appreciate it! 🐝 🚀 👏🏻
This direction makes sense to me, @timgit. Thanks again! We will try this out in Wasp and see how it goes. 👍🏻 Appreciate it
Regarding https://github.com/timgit/pg-boss/issues/365#issuecomment-1414692855, I added the following but node still crashed, so that’s not the magic solution. 😦
I’ve been able to repro node crashing just by stopping postgres locally. However, I can see pg-boss logging errors for a while before the crash since the workers haven’t been told to stop trying. Eventually, it does crash, but from an unhandled error that’s internal in pg.
For example, if you change your example to the following, node doesn’t crash because all workers have been stopped.
However, I wouldn’t recommend using this severe of an error handler since not all error events will have the same severity. If you add a condition to look for these fatal errors from pg, you could conditionally exit, which may resolve this issue in fly.