confluent-kafka-dotnet: Slow manual commits
Description
We need to expose manual sync commits to our clients, but we get really poor performance out of it when compared to async commits + callbacks.
Attached client logs with “debug: all” for sync and async versions. sync_commit.txt async_commit.txt
How to reproduce
It looks like the commit request itself is quite quick but between the request being enqueued and sent to the broker it can take a while (~1sec) and I can see quite a few FetchRequests in between commits even though the flow of our consumer is something like:
private Message<string, byte[]> ConsumeMessageSync()
{
Message<string, byte[]> kafkaMessage;
_consumer.Consume(out kafkaMessage, 100);
return kafkaMessage;
}
var msg = ConsumeMessageSync();
var clientReadyMsg = process(msg);
emitMessageToClient(clientReadyMsg);
then client subscribes and commits after each emission...
What I don’t understand is why fetch requests are issued to the broker after the commit request is enqueued and while we wait for the commit result to come back. I played around with fetch.wait.max.ms but that just changes the amount of fetch requests that gets sent in between.
Additionally there are some weird PROTOERR level messages like this:
7|2018-02-06 12:01:01.765|rdkafka#consumer-1|PROTOERR| [thrd:lonrs08346.my-domain.net:2182/bootstrap]: lonrs08346.my-domain.net:2182/2: Protocol parse failure at 1048332/1048648 (rd_kafka_msgset_reader_msg_v0_1:464) (incorrect broker.version.fallback?)
Probably not related but worth pointing out. Is there something I am missing? Thanks in advance!
Checklist
Please provide the following information:
- Confluent.Kafka nuget version: 0.11.3
- Apache Kafka version: 0.10.0.1
- Client configuration:
{“enable.auto.commit”, “false”}, {“auto.offset.reset”, “earliest”} - Operating system: Win7x64
- Provide logs (with “debug” : “…” as necessary in configuration)
- Provide broker log excerpts
- Critical issue
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 40 (20 by maintainers)
Thank you all for your patience.
I’ve now identified the issue: https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_broker.c#L3187
When committing to a broker that we’re not fetching messages from there is a high probability that queued ops (such as a Commit) will be delayed up to 1000ms before being sent, regardless of
socket.blocking.max.ms.I have a fix in place which I’ll test and then commit to master.
There is no workaround.
librdkafka issue: https://github.com/edenhill/librdkafka/issues/1787
@mhowlett I would suggest making it a just synchronous call because that’s what it is. I would leave it up to the consumers of this library whether or not they want to wrap it in a
Task.Run()or offload it onto another thread. In order to avoid making this a breaking change, you could implement aCommitAsync(this Consumer consumer, ...)extension method that wraps the synchronousConsumer.Commit()in aTask.Run(). Of course, since it’s a major release, it’s OK to make a breaking change.@GarrettDavis - If you have a scenario in which awaiting the CommitAsync call is useful, i’d be interested in some knowing more details. I started to regret not just making this a synchronous method, since I believe this reflects how most people would want to use it + behind the scenes it just wraps a synchronous call in a thread pool thread. We’re considering switching this in the 1.0 release.
sorry we haven’t got to debugging this yet. I think magnus was speculating that some other changes he wants to do may fix this as a side effect.
I have not yet looked tbh. I mentioned this bug to @edenhill again yesterday and he commented that he thought it’s likely related to and would get automatically fixed by some upcoming work happening in librdkafka. he may like to comment further here.
currently a bit swamped, haven’t forgotten about this, somewhere near the top of the priority list …
I also just checked to see if a more recent build of librdkafka resolves the issue (referenced by Confluent.Kafka 0.11.3-ci-280), no luck. I expect to get a chance to delve in to debugging librdkafka later this week or early next week.
yep, resolved by: https://github.com/edenhill/librdkafka/issues/1930 also, it was a windows only issue.
please open a new issue and paste code - thanks!
@edenhill you can use the test program posted by @mhowlett up there, the key to reproduce it was running it on windows. When is the next release out?
@mhowlett Thanks for your efforts thus far, please update us when possible fix is available 😃
Does that mean you found the cause of the issue?
sorry for the delay. if we can’t get this into the 0.11.4 release, i’ll make sure we do a CI build with it in soon after.
@mhowlett looks good
Thanks for that bit of information, I’ll try to reproduce it and see if we can find a proper fix.