kafkajs: v1.16.0 KafkaJSNumberOfRetriesExceeded change bug/question

Describe the bug Following changes in v1.16.0 we’ve experienced a breaking change within our application relating to https://github.com/tulios/kafkajs/pull/1274. I’m opening this issue to:

check my understanding of what is happening
check whether the v1.16.0 approach is likely to change
determine what changes I may need to make in our application

1.15.0 / old behaviour

At v1.15.0 and below, KafkaJSNumberOfRetriesExceeded was classed as being an error that would cause the consumer to restart. Potentially, this could lead to an infinite restart loop. However, my understanding was that the restartOnFailure option was available to allow users to customise that behaviour to the needs of the application, per the docs here: https://kafka.js.org/docs/configuration#restartonfailure.

This is what we have done in our application. In our restartOnFailure, we specifically watch for KafkaJSNumberOfRetriesExceeded and we only allow the consumer to restart a specific number of times based on this error before we then gracefully shutdown and exit.

1.16.0 / new behaviour

In v1.16.0 this no longer works. KafkaJSNumberOfRetriesExceeded leads directly to a consumer crash and does not trigger the restartOnFailure method. Based on this I have the following questions:

Is it definitely intended that restartOnFailure no longer allows customisation of behaviour for KafkaJSNumberOfRetriesExceeded and will this change be retained? Don’t want to modify our consumers to work with v1.16.0 if it might change again in the near future!
Further to (1), is the the documentation for restartOnFailure still valid? https://kafka.js.org/docs/configuration#restartonfailure I believe it’s saying specifically that consumers will restart on KafkaJSNumberOfRetriesExceeded but that is no longer the case 🤔
I’d assumed that previously, KafkaJSNumberOfRetriesExceeded would lead to a restart because a restart would potentially resolve whatever was causing the retries to fail. Is that not the case? What is the benefit of a restart compared to retry?
Depending on the answer to (3) above, is it the case that any logic relating to restarting because of KafkaJSNumberOfRetriesExceeded would now be expected to be in a consumer crash listener? i.e. we’d need to “manually” restart rather than using the consumer’s built in restart/restartOnFailure behaviour.

Thanks for the awesome library and I hope the above makes sense 🙏

Environment:

OS: MACOS
KafkaJS version 1.16.0
NodeJS version 16

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 5
Comments: 17 (1 by maintainers)

Commits related to this issue

Unwrap original error cause when determining whether to crash KafkaJSNumberOfRetriesExceeded and KafkaJSNonRetriableError are not retriable, but we still want to allow the user to restart the consume... — committed to tulios/kafkajs by Nevon 2 years ago

Most upvoted comments

Merged the fix. It’ll go out in v2.0.0, which I’m planning to ship this week, if all goes well.

Nevon on May 2, 2022

Thanks - am making a ticket in our backlog to do some testing on this and will try to get it played ASAP

rkclark on Mar 22, 2022

Thanks for the detailed information. No, I would say that’s not intended behavior, since a KafkaJSConnectionError should be a retriable error, so I would expect that the consumer restarts in that case.

Looking at the message from the crash listener, I wonder if this is a case of nested retriers somewhere, because the error that gets thrown is a KafkaJSNonRetriableError, which is expected when a retrier encounters an error that’s not retriable, but the originalError is a KafkaJSNumberOfRetriesExceeded error, not a KafkaJSConnectionError - which is what I would have expected to see.

https://github.com/tulios/kafkajs/blob/1876abcb5effd6183c1fbaa50dee9cef269b67a5/src/retry/index.js#L43-L56

Nevon on Feb 23, 2022

@Nevon Hi - finally had a chance to test this for you. I agree with @uwburn - I am seeing consistent behaviour with your changes 👍 Tested over 10 times. Thanks for your help!

rkclark on Apr 19, 2022

I am experiencing this problem as well with 1.16.

My setup is like this:

Single node Kafka cluster on minikube
Few containers in minikube running consumers
I kill the broker on purpose, to see if the error can be handled, i expect to hit restartOnFailure (as it did in 1.15)
Some consumers hit restartOnFailure (and keep going through it if i return true), some (most of them) never do as non-retriable error is raised, confirming an intermittent behavior

Testing https://github.com/tulios/kafkajs/tree/fix-retries-1299 i see a consistent behavior, all consumers hit restartOnFailure, so that the condition can be handled by my code.

Should we expect a fix in 1.16.1 or is it going straight in 1.17?

uwburn on Apr 14, 2022

We also have a backlog item to test this on our k8s environment. We will probably need to leave it running a couple days to confirm the fix.

FelipeEmerim on Mar 22, 2022

npm install tulios/kafkajs#fix-retries-1299 --save / yarn add tulios/kafkajs#fix-retries-1299 will do it.

Nevon on Mar 22, 2022

Could those affected by this try https://github.com/tulios/kafkajs/tree/fix-retries-1299 and let me know if it fixes things? Since it’s very difficult to reproduce this in a reliable way, I’m not 100% confident that this solves it, but based on the logs I believe it will.

Nevon on Mar 7, 2022