confluent-kafka-dotnet: Consumer starts with incorrect (much lower) offset after rebalance
Description
About a month ago we updated Confluent.Kafka library from 1.5.3 to the latest 1.7.0 and we started to experience offset issues after rebalancing. Those issues are happening randomly on different consumers, topics and partitions.
During application deployment (application start), after rebalance takes part, it happens that the affected partition is assigned to the consumer, however, the consumer starts consuming on offset thousands lower than the last offset committed - it causes many duplicates to be processed.
For example, when we check __consumer_offsets messages after the incident, we can see following messages for one partition [xyz.eshop,xyz.eshop.async-requests,17]:
Last committed offset Wednesday, August 4, 2021 6:10:01.506 PM - 134384
First committed offset after reassignment Wednesday, August 4, 2021 6:10:27.430 PM - 133632
How to reproduce
We did not find a way yet to reproduce it locally, the problem is new and occurs only occasionally during/after rebalance, probably during reassignment. We are not even sure it is connected with the library update itself. But we would appreciate any hints possible.
Checklist
Please provide the following information:
- A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
- Confluent.Kafka nuget version - 1.7.0
- Apache Kafka version - 1.0.2-cp2
- Client configuration
Number of brokers: 3 Number of partitions: 21 Number of consumers in a consumer group: 4 Consumer configuration:
var consumerConfig = new ConsumerConfig
{
BootstrapServers = servers,
ClientId = Environment.MachineName,
GroupId = groupId,
EnableAutoCommit = true,
EnableAutoOffsetStore = false,
AutoOffsetReset = AutoOffsetReset.Latest
};
-
we have no assignment nor revocation handlers set
-
assignment should happen during poll as a side effect
-
Operating system - debian 9.13, 4.9.0-14-amd64
-
Broker log (affected is consumer group xyz.eshop) controller.log kafka-authorizer.log server.log
-
__consumer_offsets log (affected is [xyz.eshop,xyz.eshop.async-requests,17]) outputoffsets_23.txt
-
Critical issue
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 42 (16 by maintainers)
fix: https://github.com/edenhill/librdkafka/pull/3774
Hi, adding to this thread as we have exactly the same problem using both Kafka nuget v1.6.3 and v1.8.2. We were initially using a consumer with auto-committing every 5000ms (in fact the same config settings as @georgievaja) and experiencing the same problem every time we added or removed another consumer for the given topics.
In an effort to try and diagnose this issue we tried a lot of different combinations of consumer config options, and eventually also wrote a consumer that manually commits after every message is processed, and this still has the same issue of changing the offset after a rebalance.
Our setup that exhibits the issue is: Consumers: 1 (initially) Brokers: 2 Kafka nuget versions: 1.6.3 and 1.8.2 Kafka version: 2.7.0
This is a simplified version of our consumer:
And attached are logs for the time periods relevant to this issue. MSK-logs.csv
Happy to provide any extra details as needed to help diagnose this.
It is still happening in version 1.8.2.
Happened to us again today after a couple of months. It only happened with 1 partition (of 20) - partition 19. It jumped from 198539 back to 197375 as you can see.
New facts that I have, are that it fetched the offset correctly at
2022-01-29 18:18:50:[thrd:main]: Partition xxx [19] start fetching at offset 198539Also, an interesting fact is, that number
197375is the number it fetched during the previous rebalance (1 day back at2022-01-28 15:05:16).[thrd:main]: Partition xxx [19] start fetching at offset 197375We have the same problem.
NuGet Version: 1.7.0 and
PartitionAssignmentStrategy.RoundRobinAfter adding some chaos to our deployment, we introduced frequent rebalancing and can clearly see, that we re-process old messages.
Offset 435723353 At 05:13:12.08 my-topic [3] offset 435723353
Offset jumps back to 435680920 At 05:13:15.237 my-topic [3] offset 435680920
The issue is still there in 1.8.2.
@ErenArslan Can you confirm this?
That’s right, the bug here only revolves around the stored offsets (offsets_store() and enable.auto.offset.store=true)
1.8.1 version didn’t fix that issue 😦 . Any suggestion ?
there were significant changes to the consumer rebalance code between 1.5.3 and 1.7.0 (to implement incremental rebalancing), so this is a bit concerning.
thanks for the detail @zhenxuzj - that is useful, there are quite different code paths when the handler is set and when it isn’t.
librdkafka debug logs from the consumer would be very helpful in knowing what is going on (set Debug config to “all”, or “consumer” may be enough if that is too verbose). these will give high visibility in when offsets were committed, and the result of that + relationship to rebalance workflow etc.
cc: @edenhill . are there any related fixes in v.1.8.0?