bullmq: Parent job does not execute when child job fails

I have a FlowProducer that runs a parent job after a set of children jobs have been executed. There are about 10k children that are being run concurrently. I’ve set the option removeOnFail to true on both the children and the parents but it seems that if a child fail the execution of the parent just hangs.

Could this be a bug or is it a configuration issue on my side?

Here is what the code that create the job looks like:

export const initializeCollectionStats = async () => {
  const flowProducer = new FlowProducer({connection: redisClient, sharedConnection: true});
  const totalSupply = 10000
  const tokenJobs = [];
  for (let i = firstTokenID; i <= totalSupply; i++) {
    tokenJobs.push({
      name: 'child',
      data: {
        hello: 'world'
      },
      queueName: "queueName",
      opts: {
        removeOnFail: true,
        attempts: 2,
        timeout: 3000
      }
    })
  }

  const initJob = await flowProducer.add({
    name: 'initProject',
    data: {
      contractAddress
    },
    queueName: PROJECT_INIT_QUEUE,
    children: tokenJobs,
    opts: {
      attempts: 1
    }
  })

  return initJob
}

and the worker that run the children job looks like this:

const worker = new Worker(QUEUE_NAME, async (job) => {
    const { hello } = job.data
    try {
        const metadata = await doSomethingAsync(hello)
        return "done"
    } catch (e) {
        logger.error(e)
        return null
    }
}, { connection: redisClient, concurrency: 300, sharedConnection: true});

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 28 (10 by maintainers)

Most upvoted comments

This is in my pending list, this week I can work on this feature đź‘€

hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?

Hi! Thanks for your quick answer!

It does, in part.

There are still use cases where it would be nice to allow child jobs to fail if they are not critical to the success (or partial success) of the parent. Users would then be expected to handle this “partial” state in their business logic. This is useful to keep failed child job history while retaining the possibility to have the parent job executed.

Some sort of job report can then be generated by the app, listing the successful and failed child tasks.

Hi, Is there any way to solve this problem?, I have the same problem, if the child job fails or stops, the parent job does not run

Currently by design, until all child jobs have been completed the parent job will not be processed.

I assume there’s an underlying architectural problem, because it’s not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We’re going to end up doing that last leg regardless but it’s a bit convoluted.

Anyway workaround for now seem to be:

  • Wrapping child jobs in try-catch-rethrow and on the final retry making the job pass but with some userland failed status as the return value
  • In the parent getting all the return values and deciding how to handle any failures

Arguably that’s actually quite a good way to go about it regardless as it’s more obvious than an incantation of several flags, and “completed” doesn’t have to mean “succeeded”

The workaround you suggest is precisely what we do.

Then we have some helper utils to check for the special “succeeded but actually failed” return value, and allow the parent to continue or failed based on things like the percentage of child jobs that truly succeeded or failed.

hey @theDanielJLewis, we have a pr for it https://github.com/taskforcesh/bullmq/pull/1953 You also can take a look on that one, @manast and I are evaluating this new feature

I’m in the same boat as everyone else.

It seems odd that the only two options right now for when a child fails are:

  1. The parent does nothing and is never even added to the queue.
  2. Use failParentOnFailure and the parent will always fail when a child fails, even if there were successful children to process.

I wish we had, either as the default or a different option, something like continueParentOnFailure that would allow children to fail but still add the parent to the queue and let the parent process separately.

Adding this option to children would allow us to let some children actually fail the parent, while letting others not affect the parent.

But doing nothing—not even appearing in the queue—just seems strange and not an obvious result.

@Slind14 what is your use-case?

We would like to keep a history of recently failed jobs within Redis. Otherwise removeOnFail would work, I guess.