confluent-kafka-dotnet: Consumer starts with incorrect (much lower) offset after rebalance

Description

About a month ago we updated Confluent.Kafka library from 1.5.3 to the latest 1.7.0 and we started to experience offset issues after rebalancing. Those issues are happening randomly on different consumers, topics and partitions.

During application deployment (application start), after rebalance takes part, it happens that the affected partition is assigned to the consumer, however, the consumer starts consuming on offset thousands lower than the last offset committed - it causes many duplicates to be processed. For example, when we check __consumer_offsets messages after the incident, we can see following messages for one partition [xyz.eshop,xyz.eshop.async-requests,17]:

Last committed offset Wednesday, August 4, 2021 6:10:01.506 PM - 134384

First committed offset after reassignment Wednesday, August 4, 2021 6:10:27.430 PM - 133632

How to reproduce

We did not find a way yet to reproduce it locally, the problem is new and occurs only occasionally during/after rebalance, probably during reassignment. We are not even sure it is connected with the library update itself. But we would appreciate any hints possible.

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • Confluent.Kafka nuget version - 1.7.0
  • Apache Kafka version - 1.0.2-cp2
  • Client configuration

Number of brokers: 3 Number of partitions: 21 Number of consumers in a consumer group: 4 Consumer configuration:

 var consumerConfig = new ConsumerConfig
            {
                BootstrapServers = servers,
                ClientId = Environment.MachineName,
                GroupId = groupId,
                EnableAutoCommit = true,
                EnableAutoOffsetStore = false,
                AutoOffsetReset = AutoOffsetReset.Latest
            };
  • we have no assignment nor revocation handlers set

  • assignment should happen during poll as a side effect

  • Operating system - debian 9.13, 4.9.0-14-amd64

  • Broker log (affected is consumer group xyz.eshop) controller.log kafka-authorizer.log server.log

  • __consumer_offsets log (affected is [xyz.eshop,xyz.eshop.async-requests,17]) outputoffsets_23.txt

  • Critical issue

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 42 (16 by maintainers)

Most upvoted comments

Hi, adding to this thread as we have exactly the same problem using both Kafka nuget v1.6.3 and v1.8.2. We were initially using a consumer with auto-committing every 5000ms (in fact the same config settings as @georgievaja) and experiencing the same problem every time we added or removed another consumer for the given topics.

In an effort to try and diagnose this issue we tried a lot of different combinations of consumer config options, and eventually also wrote a consumer that manually commits after every message is processed, and this still has the same issue of changing the offset after a rebalance.

Our setup that exhibits the issue is: Consumers: 1 (initially) Brokers: 2 Kafka nuget versions: 1.6.3 and 1.8.2 Kafka version: 2.7.0

This is a simplified version of our consumer:

void Consume(KafkaOptions options)
{
    var config = new ConsumerConfig()
    {
        BootstrapServers = options.BootstrapServers,
        GroupId = options.GroupId,
        PartitionAssignmentStrategy = PartitionAssignmentStrategy.CooperativeSticky,
        EnableAutoCommit = false,
        EnableAutoOffsetStore = false,
        AutoOffsetReset = AutoOffsetReset.Latest,
        AllowAutoCreateTopics = false,
    };

    var consumer = new ConsumerBuilder<string, string>(config).Build();
    
    consumer.Subscribe(topicList);

    var token = new CancellationToken();

    while (!token.IsCancellationRequested)
    {
        var cr = consumer.Consume(token);

        //In this case ProcessMessage is simply creating a different kind of message and publishing it again
        ProcessMessage(cr);

        consumer.StoreOffset(cr);
        consumer.Commit(cr);
        Logger.LogInformation(
            "Stored and committed Topic: {Topic}, Partition: {Partition}, Offset: {Offset}", cr.Topic,
            cr.Partition, cr.Offset);
    }
    
    consumer.Close();
    consumer.Dispose();
}

And attached are logs for the time periods relevant to this issue. MSK-logs.csv

  • Between 9:18:38AM and 9:19:00AM the messages were originally created by an external producer and published, then processed successfully and the offsets committed. All these messages were processed by instance 1.
  • I’ve cut the logs between 9:19:00AM and 9:28:05AM as they contain no useful data.
  • At ~9:28:05AM instance 2 containing the same consumer setup and settings was spun up, causing a rebalance.
  • At ~9:28:17AM instance 1 has started re-processing and committing the already processed and committed messages.
  • This continues for the remainder of the minute and the same offsets are present in the log at least once.
  • At no point does instance 2 appear to re-process any of the already-processed messages, although it does process some that are created as a direct result of instance 1 re-processing the older messages (which produces messages in a topic also handled by these consumers).

Happy to provide any extra details as needed to help diagnose this.

It is still happening in version 1.8.2.

Happened to us again today after a couple of months. It only happened with 1 partition (of 20) - partition 19. It jumped from 198539 back to 197375 as you can see.

Screenshot (6)

Screenshot (7)

New facts that I have, are that it fetched the offset correctly at 2022-01-29 18:18:50: [thrd:main]: Partition xxx [19] start fetching at offset 198539

Also, an interesting fact is, that number 197375 is the number it fetched during the previous rebalance (1 day back at 2022-01-28 15:05:16).

[thrd:main]: Partition xxx [19] start fetching at offset 197375

We have the same problem.

NuGet Version: 1.7.0 and PartitionAssignmentStrategy.RoundRobin

After adding some chaos to our deployment, we introduced frequent rebalancing and can clearly see, that we re-process old messages.

image

Offset 435723353 At 05:13:12.08 my-topic [3] offset 435723353

Offset jumps back to 435680920 At 05:13:15.237 my-topic [3] offset 435680920

Nov 15, 2021 @ 05:13:05.971	[thrd:main]: Topic my-topic [3]: stored offset -1001, committed offset 435723353: not including in commit
Nov 15, 2021 @ 05:13:05.971	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:07.797	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Fetch topic my-topic [3] at offset 435723353 (v58)
Nov 15, 2021 @ 05:13:09.797	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Fetch topic my-topic [3] at offset 435723353 (v58)
Nov 15, 2021 @ 05:13:10.973	[thrd:main]: Topic my-topic [3]: stored offset -1001, committed offset 435723353: not including in commit
Nov 15, 2021 @ 05:13:10.973	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:11.833	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Fetch topic my-topic [3] at offset 435723353 (v58)
Nov 15, 2021 @ 05:13:12.013	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v59
Nov 15, 2021 @ 05:13:12.013	[thrd:main]: Pause my-topic [3] (v59)
Nov 15, 2021 @ 05:13:12.014	[thrd:main]: my-topic [3] received op PAUSE (v59) in fetch-state active (opv58)
Nov 15, 2021 @ 05:13:12.014	[thrd:main]: Pause my-topic [3]: at offset 435723353 (state active, v59)
Nov 15, 2021 @ 05:13:12.016	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:12.018	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_fetch_stop:2375: new version barrier v60
Nov 15, 2021 @ 05:13:12.018	[thrd:main]: Stop consuming my-topic [3] (v60)
Nov 15, 2021 @ 05:13:12.018	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v61
Nov 15, 2021 @ 05:13:12.018	[thrd:main]: Resume my-topic [3] (v61)
Nov 15, 2021 @ 05:13:12.018	[thrd:main]: Removing (un)desired topic my-topic [3]
Nov 15, 2021 @ 05:13:12.018	[thrd:main]: Removing my-topic [3] from assignment (started=true, pending=false, queried=false, stored offset=INVALID)
Nov 15, 2021 @ 05:13:12.022	[thrd:main]: my-topic [3] received op FETCH_STOP (v60) in fetch-state active (opv59)
Nov 15, 2021 @ 05:13:12.022	[thrd:main]: Stopping fetch for my-topic [3] in state active (v60)
Nov 15, 2021 @ 05:13:12.022	[thrd:main]: Partition my-topic [3] changed fetch state active -> stopping
Nov 15, 2021 @ 05:13:12.022	[thrd:main]: my-topic [3]: offset store terminating
Nov 15, 2021 @ 05:13:12.022	[thrd:main]: Partition my-topic [3] changed fetch state stopping -> stopped
Nov 15, 2021 @ 05:13:12.022	[thrd:main]: my-topic [3] received op PAUSE (v61) in fetch-state stopped (opv60)
Nov 15, 2021 @ 05:13:12.022	[thrd:main]: Not resuming stopped my-topic [3]: at offset 435723353 (state stopped, v61)
Nov 15, 2021 @ 05:13:12.026	[thrd:main]: Group "test" received op PARTITION_LEAVE in state up (join-state wait-unassign-to-complete) for my-topic [3]
Nov 15, 2021 @ 05:13:12.026	[thrd:main]: Group "test": delete my-topic [3]
Nov 15, 2021 @ 05:13:12.057	[thrd:main]:   my-topic [3]
Nov 15, 2021 @ 05:13:12.074	[thrd:main]:  my-topic [3] offset INVALID
Nov 15, 2021 @ 05:13:12.074	[thrd:main]:  my-topic [3] offset INVALID
Nov 15, 2021 @ 05:13:12.076	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v38
Nov 15, 2021 @ 05:13:12.076	[thrd:main]: Resume my-topic [3] (v38)
Nov 15, 2021 @ 05:13:12.076	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:12.076	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:12.077	[thrd:main]: Querying committed offset for pending assigned partition my-topic [3]
Nov 15, 2021 @ 05:13:12.077	[thrd:main]: my-topic [3] received op PAUSE (v38) in fetch-state stopped (opv37)
Nov 15, 2021 @ 05:13:12.077	[thrd:main]: Not resuming my-topic [3]: partition is not paused by library
Nov 15, 2021 @ 05:13:12.086	[thrd:main]: GroupCoordinator/0: OffsetFetchResponse: my-topic [3] offset 435723353, metadata 0 byte(s): NO_ERROR
Nov 15, 2021 @ 05:13:12.086	[thrd:main]: Adding my-topic [3] back to pending list with offset 435723353
Nov 15, 2021 @ 05:13:12.086	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:12.086	[thrd:main]:  my-topic [3] offset 435723353
Nov 15, 2021 @ 05:13:12.086	[thrd:main]: Starting pending assigned partition my-topic [3] at offset 435723353
Nov 15, 2021 @ 05:13:12.086	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v39
Nov 15, 2021 @ 05:13:12.086	[thrd:main]: Resume my-topic [3] (v39)
Nov 15, 2021 @ 05:13:12.086	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_fetch_start:2348: new version barrier v40
Nov 15, 2021 @ 05:13:12.087	[thrd:main]: Start consuming my-topic [3] at offset 435723353 (v40)
Nov 15, 2021 @ 05:13:12.087	[thrd:main]: my-topic [3] received op PAUSE (v39) in fetch-state stopped (opv38)
Nov 15, 2021 @ 05:13:12.087	[thrd:main]: Not resuming my-topic [3]: partition is not paused by library
Nov 15, 2021 @ 05:13:12.087	[thrd:main]: my-topic [3] received op FETCH_START (v40) in fetch-state stopped (opv39)
Nov 15, 2021 @ 05:13:12.087	[thrd:main]: Start fetch for my-topic [3] in state stopped at offset 435723353 (v40)
Nov 15, 2021 @ 05:13:12.087	[thrd:main]: Partition my-topic [3] changed fetch state stopped -> active
Nov 15, 2021 @ 05:13:12.087	[thrd:main]: Partition my-topic [3] start fetching at offset 435723353
Nov 15, 2021 @ 05:13:12.089	[thrd:main]: Group "test" received op PARTITION_JOIN in state up (join-state steady) for my-topic [3]
Nov 15, 2021 @ 05:13:12.089	[thrd:main]: Group "test": add my-topic [3]
Nov 15, 2021 @ 05:13:12.090	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Topic my-topic [3] in state stopped at offset 435723353 (1/100000 msgs, 0/65536 kb queued, opv 58) is not fetchable: not in active fetch state
Nov 15, 2021 @ 05:13:12.090	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Removed my-topic [3] from fetch list (12 entries, opv 58): not in active fetch state
Nov 15, 2021 @ 05:13:12.091	[thrd:sasl_ssl://my-brokern]: Topic my-topic [3]: fetch decide: updating to version 40 (was 34) at offset 435723353 (was 435681193)
Nov 15, 2021 @ 05:13:12.091	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Topic my-topic [3] in state active at offset 435723353 (0/100000 msgs, 0/65536 kb queued, opv 40) is fetchable
Nov 15, 2021 @ 05:13:12.091	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Added my-topic [3] to fetch list (4 entries, opv 40, 0 messages queued): fetchable
Nov 15, 2021 @ 05:13:12.091	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Fetch topic my-topic [3] at offset 435723353 (v40)
Nov 15, 2021 @ 05:13:12.107	[thrd:main]: NEW my-topic [3] 0x7f8eef0d0be0 refcnt 0x7f8eef0d0c70 (at rd_kafka_topic_partition_cnt_update:798)
Nov 15, 2021 @ 05:13:12.108	[thrd:main]: my-topic [3]: delegate to broker sasl_ssl://my-broker:9093/0 (rktp 0x7f8eef0d0be0, term 0, ref 2)
Nov 15, 2021 @ 05:13:12.109	[thrd:main]: my-topic [3]: delegating to broker sasl_ssl://my-broker:9093/0 for partition with 0 messages (0 bytes) queued
Nov 15, 2021 @ 05:13:12.109	[thrd:main]: Migrating topic my-topic [3] 0x7f8eef0d0be0 from (none) to sasl_ssl://my-broker:9093/0 (sending PARTITION_JOIN to sasl_ssl://my-broker:9093/0)
Nov 15, 2021 @ 05:13:12.117	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Topic my-topic [3]: joining broker (rktp 0x7f8eef0d0be0, 0 message(s) queued)
Nov 15, 2021 @ 05:13:13.113	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Fetch topic my-topic [3] at offset 435723353 (v40)
Nov 15, 2021 @ 05:13:13.463	[thrd:main]: Topic my-topic [3]: stored offset 435680920, committed offset 435723353: not including in commit
Nov 15, 2021 @ 05:13:13.463	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:13.469	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v41
Nov 15, 2021 @ 05:13:13.469	[thrd:main]: Pause my-topic [3] (v41)
Nov 15, 2021 @ 05:13:13.470	[thrd:main]: my-topic [3] received op PAUSE (v41) in fetch-state active (opv40)
Nov 15, 2021 @ 05:13:13.470	[thrd:main]: Pause my-topic [3]: at offset 435723353 (state active, v41)
Nov 15, 2021 @ 05:13:13.472	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:13.473	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_fetch_stop:2375: new version barrier v42
Nov 15, 2021 @ 05:13:13.473	[thrd:main]: Stop consuming my-topic [3] (v42)
Nov 15, 2021 @ 05:13:13.473	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v43
Nov 15, 2021 @ 05:13:13.473	[thrd:main]: Resume my-topic [3] (v43)
Nov 15, 2021 @ 05:13:13.473	[thrd:main]: Removing (un)desired topic my-topic [3]
Nov 15, 2021 @ 05:13:13.473	[thrd:main]: Removing my-topic [3] from assignment (started=true, pending=false, queried=false, stored offset=435680920)
Nov 15, 2021 @ 05:13:13.475	[thrd:main]: my-topic [3] received op FETCH_STOP (v42) in fetch-state active (opv41)
Nov 15, 2021 @ 05:13:13.475	[thrd:main]: Stopping fetch for my-topic [3] in state active (v42)
Nov 15, 2021 @ 05:13:13.475	[thrd:main]: Partition my-topic [3] changed fetch state active -> stopping
Nov 15, 2021 @ 05:13:13.475	[thrd:main]: my-topic [3]: offset store terminating
Nov 15, 2021 @ 05:13:13.475	[thrd:main]: Partition my-topic [3] changed fetch state stopping -> stopped
Nov 15, 2021 @ 05:13:13.476	[thrd:main]: my-topic [3] received op PAUSE (v43) in fetch-state stopped (opv42)
Nov 15, 2021 @ 05:13:13.476	[thrd:main]: Not resuming stopped my-topic [3]: at offset 435723353 (state stopped, v43)
Nov 15, 2021 @ 05:13:13.478	[thrd:main]: Group "test" received op PARTITION_LEAVE in state up (join-state wait-unassign-to-complete) for my-topic [3]
Nov 15, 2021 @ 05:13:13.479	[thrd:main]: Group "test": delete my-topic [3]
Nov 15, 2021 @ 05:13:14.098	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Topic my-topic [3] in state stopped at offset 435723353 (1/100000 msgs, 0/65536 kb queued, opv 40) is not fetchable: not in active fetch state
Nov 15, 2021 @ 05:13:14.098	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Removed my-topic [3] from fetch list (6 entries, opv 40): not in active fetch state
Nov 15, 2021 @ 05:13:15.183	[thrd:main]:   my-topic [3]
Nov 15, 2021 @ 05:13:15.200	[thrd:main]:  my-topic [3] offset INVALID
Nov 15, 2021 @ 05:13:15.201	[thrd:main]:  my-topic [3] offset INVALID
Nov 15, 2021 @ 05:13:15.225	[thrd:main]: NEW my-topic [3] 0x7f9b81c1e020 refcnt 0x7f9b81c1e0b0 (at rd_kafka_topic_partition_cnt_update:798)
Nov 15, 2021 @ 05:13:15.225	[thrd:main]: my-topic [3]: delegate to broker sasl_ssl://my-broker:9093/0 (rktp 0x7f9b81c1e020, term 0, ref 2)
Nov 15, 2021 @ 05:13:15.225	[thrd:main]: my-topic [3]: delegating to broker sasl_ssl://my-broker:9093/0 for partition with 0 messages (0 bytes) queued
Nov 15, 2021 @ 05:13:15.225	[thrd:main]: Migrating topic my-topic [3] 0x7f9b81c1e020 from (none) to sasl_ssl://my-broker:9093/0 (sending PARTITION_JOIN to sasl_ssl://my-broker:9093/0)
Nov 15, 2021 @ 05:13:15.225	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Topic my-topic [3]: joining broker (rktp 0x7f9b81c1e020, 0 message(s) queued)
Nov 15, 2021 @ 05:13:15.225	[thrd:main]: my-topic [3]: marking as DESIRED
Nov 15, 2021 @ 05:13:15.228	[thrd:main]: NEW my-topic [3] 0x7f394bfae010 refcnt 0x7f394bfae0a0 (at rd_kafka_topic_partition_cnt_update:798)
Nov 15, 2021 @ 05:13:15.228	[thrd:main]: my-topic [3]: delegate to broker sasl_ssl://my-broker:9093/0 (rktp 0x7f394bfae010, term 0, ref 2)
Nov 15, 2021 @ 05:13:15.228	[thrd:main]: my-topic [3]: delegating to broker sasl_ssl://my-broker:9093/0 for partition with 0 messages (0 bytes) queued
Nov 15, 2021 @ 05:13:15.228	[thrd:main]: Migrating topic my-topic [3] 0x7f394bfae010 from (none) to sasl_ssl://my-broker:9093/0 (sending PARTITION_JOIN to sasl_ssl://my-broker:9093/0)
Nov 15, 2021 @ 05:13:15.228	[thrd:sasl_ssl://my-brokern]: sasl_ssl://my-broker:9093/0: Topic my-topic [3]: joining broker (rktp 0x7f394bfae010, 0 message(s) queued)
Nov 15, 2021 @ 05:13:15.231	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v2
Nov 15, 2021 @ 05:13:15.231	[thrd:main]: Resume my-topic [3] (v2)
Nov 15, 2021 @ 05:13:15.231	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:15.231	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:15.231	[thrd:main]: Not resuming my-topic [3]: partition is not paused by library
Nov 15, 2021 @ 05:13:15.231	[thrd:main]: Querying committed offset for pending assigned partition my-topic [3]
Nov 15, 2021 @ 05:13:15.231	[thrd:main]: my-topic [3] received op PAUSE (v2) in fetch-state none (opv1)
Nov 15, 2021 @ 05:13:15.237	[thrd:main]: GroupCoordinator/0: OffsetFetchResponse: my-topic [3] offset 435680920, metadata 0 byte(s): NO_ERROR
Nov 15, 2021 @ 05:13:15.237	[thrd:main]: Adding my-topic [3] back to pending list with offset 435680920
Nov 15, 2021 @ 05:13:15.237	[thrd:main]:  my-topic [3] offset STORED
Nov 15, 2021 @ 05:13:15.237	[thrd:main]:  my-topic [3] offset 435680920
Nov 15, 2021 @ 05:13:15.237	[thrd:main]: Starting pending assigned partition my-topic [3] at offset 435680920
Nov 15, 2021 @ 05:13:15.237	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_pause_resume:2434: new version barrier v3
Nov 15, 2021 @ 05:13:15.237	[thrd:main]: Resume my-topic [3] (v3)
Nov 15, 2021 @ 05:13:15.237	[thrd:main]: my-topic [3]: rd_kafka_toppar_op_fetch_start:2348: new version barrier v4
Nov 15, 2021 @ 05:13:15.237	[thrd:main]: Start consuming my-topic [3] at offset 435680920 (v4)
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: my-topic [3] received op PAUSE (v3) in fetch-state none (opv2)
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: Not resuming my-topic [3]: partition is not paused by library
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: my-topic [3] received op FETCH_START (v4) in fetch-state none (opv3)
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: Start fetch for my-topic [3] in state none at offset 435680920 (v4)
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: Partition my-topic [3] changed fetch state none -> active
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: Partition my-topic [3] start fetching at offset 435680920
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: Group "test" received op PARTITION_JOIN in state up (join-state steady) for my-topic [3]
Nov 15, 2021 @ 05:13:15.238	[thrd:main]: Group "test": add my-topic [3]

The issue is still there in 1.8.2.

@ErenArslan Can you confirm this?

That’s right, the bug here only revolves around the stored offsets (offsets_store() and enable.auto.offset.store=true)

1.8.1 version didn’t fix that issue 😦 . Any suggestion ?

there were significant changes to the consumer rebalance code between 1.5.3 and 1.7.0 (to implement incremental rebalancing), so this is a bit concerning.

thanks for the detail @zhenxuzj - that is useful, there are quite different code paths when the handler is set and when it isn’t.

librdkafka debug logs from the consumer would be very helpful in knowing what is going on (set Debug config to “all”, or “consumer” may be enough if that is too verbose). these will give high visibility in when offsets were committed, and the result of that + relationship to rebalance workflow etc.

cc: @edenhill . are there any related fixes in v.1.8.0?

Auto commit occurs at poll. this is correct for the Java client, but in librdkafka, the timing is independent.