kafkajs: v1.16.0 KafkaJSNumberOfRetriesExceeded change bug/question

Describe the bug Following changes in v1.16.0 we’ve experienced a breaking change within our application relating to https://github.com/tulios/kafkajs/pull/1274. I’m opening this issue to:

  1. check my understanding of what is happening
  2. check whether the v1.16.0 approach is likely to change
  3. determine what changes I may need to make in our application

1.15.0 / old behaviour

At v1.15.0 and below, KafkaJSNumberOfRetriesExceeded was classed as being an error that would cause the consumer to restart. Potentially, this could lead to an infinite restart loop. However, my understanding was that the restartOnFailure option was available to allow users to customise that behaviour to the needs of the application, per the docs here: https://kafka.js.org/docs/configuration#restartonfailure.

This is what we have done in our application. In our restartOnFailure, we specifically watch for KafkaJSNumberOfRetriesExceeded and we only allow the consumer to restart a specific number of times based on this error before we then gracefully shutdown and exit.


1.16.0 / new behaviour

In v1.16.0 this no longer works. KafkaJSNumberOfRetriesExceeded leads directly to a consumer crash and does not trigger the restartOnFailure method. Based on this I have the following questions:

  1. Is it definitely intended that restartOnFailure no longer allows customisation of behaviour for KafkaJSNumberOfRetriesExceeded and will this change be retained? Don’t want to modify our consumers to work with v1.16.0 if it might change again in the near future!
  2. Further to (1), is the the documentation for restartOnFailure still valid? https://kafka.js.org/docs/configuration#restartonfailure I believe it’s saying specifically that consumers will restart on KafkaJSNumberOfRetriesExceeded but that is no longer the case 🤔
  3. I’d assumed that previously, KafkaJSNumberOfRetriesExceeded would lead to a restart because a restart would potentially resolve whatever was causing the retries to fail. Is that not the case? What is the benefit of a restart compared to retry?
  4. Depending on the answer to (3) above, is it the case that any logic relating to restarting because of KafkaJSNumberOfRetriesExceeded would now be expected to be in a consumer crash listener? i.e. we’d need to “manually” restart rather than using the consumer’s built in restart/restartOnFailure behaviour.

Thanks for the awesome library and I hope the above makes sense 🙏


Environment:

  • OS: MACOS
  • KafkaJS version 1.16.0
  • NodeJS version 16

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 5
  • Comments: 17 (1 by maintainers)

Commits related to this issue

Most upvoted comments

Merged the fix. It’ll go out in v2.0.0, which I’m planning to ship this week, if all goes well.

Thanks - am making a ticket in our backlog to do some testing on this and will try to get it played ASAP

Thanks for the detailed information. No, I would say that’s not intended behavior, since a KafkaJSConnectionError should be a retriable error, so I would expect that the consumer restarts in that case.

Looking at the message from the crash listener, I wonder if this is a case of nested retriers somewhere, because the error that gets thrown is a KafkaJSNonRetriableError, which is expected when a retrier encounters an error that’s not retriable, but the originalError is a KafkaJSNumberOfRetriesExceeded error, not a KafkaJSConnectionError - which is what I would have expected to see.

https://github.com/tulios/kafkajs/blob/1876abcb5effd6183c1fbaa50dee9cef269b67a5/src/retry/index.js#L43-L56

@Nevon Hi - finally had a chance to test this for you. I agree with @uwburn - I am seeing consistent behaviour with your changes 👍 Tested over 10 times. Thanks for your help!

I am experiencing this problem as well with 1.16.

My setup is like this:

  • Single node Kafka cluster on minikube
  • Few containers in minikube running consumers
  • I kill the broker on purpose, to see if the error can be handled, i expect to hit restartOnFailure (as it did in 1.15)
  • Some consumers hit restartOnFailure (and keep going through it if i return true), some (most of them) never do as non-retriable error is raised, confirming an intermittent behavior

Testing https://github.com/tulios/kafkajs/tree/fix-retries-1299 i see a consistent behavior, all consumers hit restartOnFailure, so that the condition can be handled by my code.

Should we expect a fix in 1.16.1 or is it going straight in 1.17?

We also have a backlog item to test this on our k8s environment. We will probably need to leave it running a couple days to confirm the fix.

npm install tulios/kafkajs#fix-retries-1299 --save / yarn add tulios/kafkajs#fix-retries-1299 will do it.

Could those affected by this try https://github.com/tulios/kafkajs/tree/fix-retries-1299 and let me know if it fixes things? Since it’s very difficult to reproduce this in a reliable way, I’m not 100% confident that this solves it, but based on the logs I believe it will.