kubernetes-kafka: Regular broker crashes since 5.0
I’m having some serious issues with my kafka cluster since upgrading to 5.0 / Kafka 2.1.0. The cluster ran completely stable for the last 6 months, but since the upgrade it crashes every few days in regular operation and within a few minutes if i put heavy load on it. Unfortunately i can’t downgrade as the kafka upgrade guide states there were some incompatible changes to the internal consumer offsets schema. I also have a hard time tracing the issue, but it always has the same symptoms:
The consumers/producers in all my client services suddenly produce a lot of errors like
Got error produce response with correlation id 3827806 on topic-partition ..., retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION``
and
Failed to commit stream task 1_15 due to the following error: org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {...=OffsetAndMetadata{offset=28547576, leaderEpoch=null, metadata=''}}
while that happens, the broker logs show that the connection to a certain node fails (the node is always different):
WARN [Controller id=0, targetBrokerId=0] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.24.6.64:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
WARN [ReplicaFetcher replicaId=0, leaderId=4, fetcherId=3] Error in response for fetch request...
while the affected node (in this case node 0) logs the following:
java.lang.IllegalStateException: No entry found for connection X at
org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:330) at
org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:134) at
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:885) at
org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:276) at
org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:64) at
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:92) at
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241) at
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130) at
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129) at scala.Option.foreach(Option.scala:257) at
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
The only solution i found so far is to stop all my services, scale down the brokers to zero and back up again, effictively restarting everything.
When starting back up, the affected node also restores a ton of corrupted index files (...Found a corrupted index file corresponding to log file...
). The other nodes start up normally.
The cluster consists of 5 brokers and runs on GCE. I also tried up/downgrading the kubernetes versions but that didn’t help, so my guess its either an issue with kafka 2.1.0 itself or has something to do with the upgrade to java 11. Any ideas?
Thanks.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 19 (9 by maintainers)
It’s both, but more consumer-heavy than producing. A typical scenario is about 60MB/s read and 10MB/s write per node so 300MB/s read and 60MB/s write in total.
So far the java 8 based nodes are running without any hickups in regular operation and a few high load scenarios.
Yes thats the change.
I don’t see how its a memory issue because the reported usage of the container never goes above 1.5gigs of ram, thats with java8, without resource limits and with a memory limit of 8gigs.