kafkajs: v1.16.0 KafkaJSNumberOfRetriesExceeded change bug/question
Describe the bug Following changes in v1.16.0 we’ve experienced a breaking change within our application relating to https://github.com/tulios/kafkajs/pull/1274. I’m opening this issue to:
- check my understanding of what is happening
- check whether the v1.16.0 approach is likely to change
- determine what changes I may need to make in our application
1.15.0 / old behaviour
At v1.15.0 and below, KafkaJSNumberOfRetriesExceeded was classed as being an error that would cause the consumer to restart. Potentially, this could lead to an infinite restart loop. However, my understanding was that the restartOnFailure option was available to allow users to customise that behaviour to the needs of the application, per the docs here: https://kafka.js.org/docs/configuration#restartonfailure.
This is what we have done in our application. In our restartOnFailure, we specifically watch for KafkaJSNumberOfRetriesExceeded and we only allow the consumer to restart a specific number of times based on this error before we then gracefully shutdown and exit.
1.16.0 / new behaviour
In v1.16.0 this no longer works. KafkaJSNumberOfRetriesExceeded leads directly to a consumer crash and does not trigger the restartOnFailure method. Based on this I have the following questions:
- Is it definitely intended that
restartOnFailureno longer allows customisation of behaviour forKafkaJSNumberOfRetriesExceededand will this change be retained? Don’t want to modify our consumers to work with v1.16.0 if it might change again in the near future! - Further to (1), is the the documentation for
restartOnFailurestill valid? https://kafka.js.org/docs/configuration#restartonfailure I believe it’s saying specifically that consumers will restart onKafkaJSNumberOfRetriesExceededbut that is no longer the case 🤔 - I’d assumed that previously,
KafkaJSNumberOfRetriesExceededwould lead to a restart because a restart would potentially resolve whatever was causing the retries to fail. Is that not the case? What is the benefit of a restart compared to retry? - Depending on the answer to (3) above, is it the case that any logic relating to restarting because of
KafkaJSNumberOfRetriesExceededwould now be expected to be in a consumer crash listener? i.e. we’d need to “manually” restart rather than using the consumer’s built in restart/restartOnFailure behaviour.
Thanks for the awesome library and I hope the above makes sense 🙏
Environment:
- OS: MACOS
- KafkaJS version 1.16.0
- NodeJS version 16
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 5
- Comments: 17 (1 by maintainers)
Merged the fix. It’ll go out in v2.0.0, which I’m planning to ship this week, if all goes well.
Thanks - am making a ticket in our backlog to do some testing on this and will try to get it played ASAP
Thanks for the detailed information. No, I would say that’s not intended behavior, since a KafkaJSConnectionError should be a retriable error, so I would expect that the consumer restarts in that case.
Looking at the message from the crash listener, I wonder if this is a case of nested retriers somewhere, because the error that gets thrown is a
KafkaJSNonRetriableError, which is expected when a retrier encounters an error that’s not retriable, but theoriginalErroris aKafkaJSNumberOfRetriesExceedederror, not aKafkaJSConnectionError- which is what I would have expected to see.https://github.com/tulios/kafkajs/blob/1876abcb5effd6183c1fbaa50dee9cef269b67a5/src/retry/index.js#L43-L56
@Nevon Hi - finally had a chance to test this for you. I agree with @uwburn - I am seeing consistent behaviour with your changes 👍 Tested over 10 times. Thanks for your help!
I am experiencing this problem as well with 1.16.
My setup is like this:
restartOnFailure(as it did in 1.15)restartOnFailure(and keep going through it if i returntrue), some (most of them) never do as non-retriable error is raised, confirming an intermittent behaviorTesting https://github.com/tulios/kafkajs/tree/fix-retries-1299 i see a consistent behavior, all consumers hit
restartOnFailure, so that the condition can be handled by my code.Should we expect a fix in 1.16.1 or is it going straight in 1.17?
We also have a backlog item to test this on our k8s environment. We will probably need to leave it running a couple days to confirm the fix.
npm install tulios/kafkajs#fix-retries-1299 --save/yarn add tulios/kafkajs#fix-retries-1299will do it.Could those affected by this try https://github.com/tulios/kafkajs/tree/fix-retries-1299 and let me know if it fixes things? Since it’s very difficult to reproduce this in a reliable way, I’m not 100% confident that this solves it, but based on the logs I believe it will.