confluent-kafka-go: Kafka consumer gets stuck after exceeding max.poll.interval.ms
Description
When the consumer does not receives a message for 5 mins (default value of max.poll.interval.ms 300000ms) the consumer comes to a halt without exiting the program. The consumer process hangs and does not consume any more messages.
The following error message gets logged
MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 255ms (adjust max.poll.interval.ms for long-running message processing): leaving group
I see that ErrMaxPollExceeded is defined here but unable to find where it is getting raised.
If any such error is raised, why does the program not exit ?
Checklist
Please provide the following information:
- confluent-kafka-python and librdkafka version (
confluent_kafka.version(master)andconfluent_kafka.libversion(1.0.0)): - Apache Kafka broker version: v1.1.0
- Client configuration:
{ "bootstrap.servers": "my.kafka.host", "group.id": "my.group.id", "auto.offset.reset": "earliest", "enable.auto.commit": false } - Operating system:
- Provide client logs (with
'debug': '..'as necessary) - Provide broker log excerpts
- Critical issue
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 9
- Comments: 37 (8 by maintainers)
Commits related to this issue
- Reverted to the confluent-kafka-go/kafka/v1 due to https://github.com/confluentinc/confluent-kafka-go/issues/344 — committed to alrusov/kafka by alrusov a year ago
Add us to the set of people who are definitely seeing this problem despite calling Poll far more frequently than max.poll.interval.ms. In our case, we implemented a heartbeat that gets emitted from the loop that contains Poll() and have a separate thread that will alert if the heartbeat stops. We are seeing max poll interval exceeded and getting kicked out of the consumer group even though the heartbeat is continuous. Additionally, we are also checking for kafka errors in the poll results, specifically looking for the
kafka.ErrMaxPollExceedederror code. So we are definitely calling Poll every 100ms and emitting a heartbeat from that loop. We are definitely NOT receiving MaxPollExceeded error in the poll results even when we are kicked out of the consumer group for apparently exceeding the max poll interval. The implication is that there is a failure between calling consumer.Poll() in the go package and the go package actually calling poll() in librdkafka. Not only that, but we are using a logger which emits json and the only evidence we have of the error occurring is the log message emitted to stdout via the client, which is NOT wrapped in json. So we are calling Poll() but we are neveer receiving the error that can only be received via poll(), and that error is that we are not calling poll() even though we know we are. There is clearly a bug inside the go consumer.Poll() before the actual call to librdkafka’s poll() function which is not generating any useful output to the caller.One hypothesis we are about to test is that this is caused by linking dynamically to librdkafka when doing a musl build when using an alpine container, which might explain why no one at confluent seems able to reproduce this behaviour when so many of us are seeing it.
i am having this issue too, how to fix this anyway?
Have also found this issue happening in production as well on
1.8.2. Usually identified when lag starts randomly spiking and pods need restarted. What is strange is that in dev/qa environments that see less traffic and definitely have potential for longer times between messages…I never see this particular error. Only on high throughput environments…Also, when it does get stuck, it seems like there are
no active members…so no clients assigned to partitions. Curious if this is the same behavior others are seeing?We’ve run into this issue as well - the consumer gets hung when it’s working with a topic/partition with a huge backlog. We could work around this by handling
RevokedPartitionsevent in the rebalance callback like so:And FWIW, we also run the consumer on arm64 machines with dynamically linked librdkafka.
Hello @edenhill, I’m running into a similar issue as the original poster, I’m using a
-1timeout but calling in an infinite loop, e.g.:My producer stopped writing messages for a few minutes and I logged this:
Subsequently my producer was back up but the consumer seemed to be hanging on
ReadMessage(-1)indefinitely.According to the doc you linked to:
I’d expect that my consumer did indeed leave the group, but the subsequent call to
ReadMessage()should have made the consumer rejoin the group and continue to see new messages.Is this a configuration issue? Would using a shorter timeout for
ReadMessage()resolve this?Or is this a manifestation of https://github.com/edenhill/librdkafka/issues/2266?
librdkafka version:
1.0.0confluent-kafka-go version:v1.0.0Client configuration:["auto.offset.reset": "earliest", "enable.auto.commit": false]Alright we are seeing this issue as well, and have tried all of the available solutions in this issue. We consistently get this error even if we don’t do anything at all with the message after polling and have a tight loop calling many hundreds of times a second… Clearly something wrong in the underlying library interaction here between go <> librdkafka 🤔
fwiw we are running on x86_64 machines inside of debian based containers. We have metrics emitted for every time we call Poll so we know exactly how often we are calling the method, and also have a histogram for tracking how long we wait in the poll. As mentioned above we are calling 100’s of times a second and are seeing single digit ms latency for the call at the p99 quantile. We also have the
max.poll.interval.msvalue set to 20minutes just to confirm without a shadow of a doubt we are calling multiple times in the interval.The rebalance fix above “works” in that we re-subscribe and start collecting messages again, however this causes a ton of thrashing in our cluster, so its not ideal.
I got this same issue on 1.8.2. As this issue came on Production so there I don’t have debug=cgrp enabled This is what I just got last night with 1.8.2
%4|1639508773.124|MAXPOLL|rdkafka#consumer-4| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 196ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Using v1.5.2. Also calling
ReadMessage(-1)in an infinite loop, and not seeing rejoining after consumer leaving group, worked around it by setting timeout to be less thanmax.poll.interval.msinstead of -1, but wondering why it’s not rejoining as expected.I am having this issue with librdkafka 1.5.0, exactly as keyan said. Can anyone help?