sarama: AsyncProducer `Close` blocked for infinity
Versions
Please specify real version numbers or git SHAs, not just “Latest” since that changes fairly regularly.
| Sarama | Kafka | Go |
|---|---|---|
| 1.27.2 | 3.0.0 | 1.7.5 |
Configuration
What configuration values are you using for Sarama and Kafka?
config.Metadata.Retry.Max = 3
config.Metadata.Retry.Backoff = 250 * time.Millisecond
config.Metadata.Timeout = 1 * time.Minute
// Admin.Retry take effect on `ClusterAdmin` related operations,
// only `CreateTopic` for cdc now. Just use default values.
config.Admin.Retry.Max = 5
config.Admin.Retry.Backoff = 100 * time.Millisecond
config.Admin.Timeout = 3 * time.Second
config.Producer.Retry.Max = 3
config.Producer.Retry.Backoff = 100 * time.Millisecond
config.Producer.Partitioner = sarama.NewManualPartitioner
config.Producer.MaxMessageBytes = c.MaxMessageBytes
config.Producer.Return.Successes = true
config.Producer.Return.Errors = true
config.Producer.RequiredAcks = sarama.WaitForAll
Logs
We are testing sarama in an extremely rare scenario:
- producer can send request to the 1 machine Kafka cluster, but cannot get a response
Kill -s STOPthe kafka broker process, this will make the TCP connection remain, but will not send a response.- Close the producer by call
asyncProducer.Close()
we do not have a log from sarama, but have grabbed some goroutine stack, like the following one, it looks blocked on trying to receive a response from the broker.

Our purpose is that when try to close the producer, it should not be blocked for a long time, instead of return as soon as possible.

But the reality as shown in the picture above, 33 messages failed to deliver after 38minutes, and it was after the process resume by kill -s CONT
Problem Description
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 7
- Comments: 15 (2 by maintainers)
I believe there are two issues discussed here with similar outcomes.
Default configuration of the AsyncProducer can lead to long delays
By default the
AsyncProducerhas retries and backoff enabled that can lead to long delays before failing. Such long delays are often a combination of multiple records being buffered and the current in flight request being “stuck” reading from a broker. The default timeout when reading from a TCP socket is 30 seconds (config.Net.ReadTimeout), so it might take up to 30 seconds to notice that the connection to a given broker is broken (assuming the connection was not properly closed on both ends). Then you might see a 100ms backoff per record (if you have 600 pending records going to a given partition, you might be waiting up to 1 minute) before you can replay all those records. Also if the target broker is not reachable anymore, you might also hit another 30 seconds delay (config.Net.DialTimeout) before triggering another retry (up to 3 by default).So depending on how the
AsyncProduceris configured and the type of network error, it might look the producer is stuck but it is actually mostly idle because of the retry logic.See #1359 for yet another example on how it can take up to 4 minutes to fail trying to connect to a 2 brokers cluster.
Unfortunately because of how the pipeline logic works, I don’t think the
shutdownmessage (created when closing theAsyncProducer) is handled till all the queued records are processed. That is existing retries need to finish (which is often what is taking time) but “new” retries will be cancelled. To fail faster, you could disable retries and handle them yourself:Deadlock on retries (specific Sarama 1.31.0)
The other issue is indeed a deadlock when a
brokerProduceris trying to callCloseon itsbrokerinside the callback ofbroker.AsyncProduce, this happens when dealing when receiving failedProduceresponse while sending concurrently anotherProducerequest. Such callback is called from theresponseReceivergoroutine but theClosereceiver blocks:Producerequest that reaches the maximum number of in flights requests (config.Net.MaxOpenRequests) by trying to acquire theb.lock.responseReceivergoroutine is done (by reading fromb.done).This is a regression from #2094 and I should have a fix with a simple unit test for that soon. I believe #2129 describes that regression as well and I don’t think it is specific to the
SyncProducer.We’ve seen this deadlock repeatedly on several Kafka clusters as soon as we picked up v1.31.x, and we haven’t seen it since we rolled out back our vendored deps to v1.30.1, so there is definitely a bug in the code, it’s not a problem due to a dead broker (our brokers are fine).
I think I came across the same issue with Sarama 1.31.0 kafka 1.1.0 with golang 1.17.5 And below is my extracted goroutine stack
how about the below fix