pg-boss: Unexpected node shutdown when database is abruptly shutdown

Hi there! We use this awesome library as our jobs provider with Wasp. 😄 We recently had a user notice that when they deleted their DB on Fly.io, the app came crashing down. Note: Wasp does make use of boss.on('error', error => console.error(error)) as recommended.

Strangely, when I tried to reproduce locally by just shutting Postgres down when the app was running with cron jobs, I got the following error but it did not die:

Server(stderr): error: terminating connection due to administrator command

So it seems like it noticed and handled it gracefully.

But then when I deployed it to Fly.io and killed the DB, I got the following (different) error:

2023-02-01T19:58:32Z app[f07d5e6d] mia [info]Error: Connection terminated unexpectedly
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Connection.<anonymous> (/app/server/node_modules/pg/lib/client.js:131:73)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Object.onceWrapper (node:events:627:28)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Connection.emit (node:events:513:28)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Socket.<anonymous> (/app/server/node_modules/pg/lib/connection.js:112:12)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Socket.emit (node:events:525:35)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at endReadableNT (node:internal/streams/readable:1359:12)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]  client: Client {
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    _events: [Object: null prototype] { error: [Function (anonymous)] },
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    _eventsCount: 1,
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    _maxListeners: undefined,
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    connectionParameters: 
...

This caused the node process to die, and it only happens when pg-boss is used and jobs are enabled (Express-only apps are fine). Is it possible that even though pg-boss is handling the error event: https://github.com/timgit/pg-boss/blob/master/src/db.js#L15 and we are, that somehow a pg error is escaping and bubbling up to the top when the DB is killed?

Happy to try to do anything needed to narrow in on this, just let me know! Thanks!

About this issue

Original URL
State: closed
Created a year ago
Comments: 17 (5 by maintainers)

Most upvoted comments

Thanks for the support!

Following is an example of adding retries based on ECONNREFUSED. It’s simple, and given the number of maintenance jobs, it fails pretty fast, so adjust as you think appropriate.

let errConnectionRetries = 0

  boss.on('error', error => {

    console.error(error)

    if(error.code === 'ECONNREFUSED') {
      errConnectionRetries++
    }

    if(errConnectionRetries > 5) {
      console.log(`Connection lost to postgres after ${errConnectionRetries} retries.  Stopping.`)
      boss.stop().catch(console.error)
    }
    
  })

timgit on Feb 10, 2023

Hey @timgit, I was just starting to think about it more and I’m wondering what we can do as a stopgap in this scenario. Do you think the best case is for me to try and catch/decode the type of pg error and shut pg-boss down if it is critical (sort of what you replied in your first comment)? Or, assuming we even can fix the pg error leak, would just letting it keep trying on this type of error be the ideal state (maybe not)? Do we need to change the internal pg-boss DB status of the pool to something other than opened in this case maybe?

Per the untrapped exception in Timekeeper.getCronTime in the call to this.db.executeSql, just to confirm- given the error handling they propose setting up, we would expect a call to this.pool.query that has an 'ECONNREFUSED' to emit an error we can handle instead of blowing up? And if it did, then pg-boss side would be all good? Trying to think how I can best distill the problem to them using only pg if possible in an example app. Thanks so much!

BTW You should hopefully see Wasp as a backer now as a small thanks for all the great work you are doing for the community with this OSS. 🚀

shayneczyzewski on Feb 9, 2023

Thanks for the sample project. I’ll try to repro.

timgit on Feb 3, 2023

Thanks for the fix and new release with this included, @timgit! We at Wasp really appreciate it! 🐝 🚀 👏🏻

shayneczyzewski on Mar 6, 2023

Thanks for the support!

Following is an example of adding retries based on ECONNREFUSED. It’s simple, and given the number of maintenance jobs, it fails pretty fast, so adjust as you think appropriate.
let errConnectionRetries = 0

  boss.on('error', error => {

    console.error(error)

    if(error.code === 'ECONNREFUSED') {
      errConnectionRetries++
    }

    if(errConnectionRetries > 5) {
      console.log(`Connection lost to postgres after ${errConnectionRetries} retries.  Stopping.`)
      boss.stop().catch(console.error)
    }
    
  })

This direction makes sense to me, @timgit. Thanks again! We will try this out in Wasp and see how it goes. 👍🏻 Appreciate it

shayneczyzewski on Feb 10, 2023

Regarding https://github.com/timgit/pg-boss/issues/365#issuecomment-1414692855, I added the following but node still crashed, so that’s not the magic solution. 😦

pool.on('connect', client => client.on('error', error => this.emit('error', error)))

timgit on Feb 3, 2023

I’ve been able to repro node crashing just by stopping postgres locally. However, I can see pg-boss logging errors for a while before the crash since the workers haven’t been told to stop trying. Eventually, it does crash, but from an unhandled error that’s internal in pg.

For example, if you change your example to the following, node doesn’t crash because all workers have been stopped.

    boss.on('error', error => {
        console.error(error)
        boss.stop({ graceful: false }).catch(err => console.error(err))
    });

However, I wouldn’t recommend using this severe of an error handler since not all error events will have the same severity. If you add a condition to look for these fatal errors from pg, you could conditionally exit, which may resolve this issue in fly.

timgit on Feb 3, 2023