strimzi-kafka-operator: Topic Operator failing to start with io.vertx.core.VertxException: Thread blocked
Describe the bug When deploying a very simple cluster with the topicOperator enabled, the topicOperator container fails to start. The logs for the container report a blocked thread. The k8s liveness check eventually kills the container.
2021-12-16 00:16:50,79115 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 2542 ms, time limit is 2000 ms
2021-12-16 00:16:51,79090 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 3542 ms, time limit is 2000 ms
2021-12-16 00:16:52,79034 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 4541 ms, time limit is 2000 ms
2021-12-16 00:16:53,79105 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 5542 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
at jdk.internal.misc.Unsafe.park(Native Method) ~[?:?]
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) ~[?:?]
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) ~[?:?]
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) ~[?:?]
at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998) ~[?:?]
at io.apicurio.registry.utils.ConcurrentUtil.get(ConcurrentUtil.java:35) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
at io.apicurio.registry.utils.ConcurrentUtil.get(ConcurrentUtil.java:27) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
at io.apicurio.registry.utils.ConcurrentUtil.result(ConcurrentUtil.java:54) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
at io.strimzi.operator.topic.Session.lambda$start$9(Session.java:198) ~[io.strimzi.topic-operator-0.26.0.jar:0.26.0]
at io.strimzi.operator.topic.Session$$Lambda$278/0x0000000840319840.handle(Unknown Source) ~[?:?]
at io.vertx.core.impl.future.FutureImpl$3.onSuccess(FutureImpl.java:141) ~[io.vertx.vertx-core-4.1.5.jar:4.1.5]
at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54) ~[io.vertx.vertx-core-4.1.5.jar:4.1.5]
at io.vertx.core.impl.future.FutureBase$$Lambda$293/0x000000084031e040.run(Unknown Source) ~[?:?]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[io.netty.netty-transport-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
To Reproduce Steps to reproduce the behavior:
- Install Strimzi Operator using the 0.26.0 helm chart
- Create a Cluster manifest:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: kafka-basic
spec:
kafka:
version: 3.0.0
replicas: 1
listeners:
- name: plain
port: 9092
type: internal
tls: false
storage:
type: ephemeral
zookeeper:
replicas: 1
storage:
type: ephemeral
entityOperator:
topicOperator: {}
userOperator: {}
- Apply the manifest with
kubectl apply -f kafka-basic.yaml - Watch the topic operator logs with
kubectl logs deploy/kafka-basic-entity-operator -c topic-operator
Expected behavior The topic operator starts correctly.
Environment:
- Strimzi version: 0.26.0
- Installation method: Helm chart
- Kubernetes cluster: Kubernetes 1.20.7
- Infrastructure: Amazon EKS
YAML files and logs Thanks for the handy script! report-16-12-2021_11-26-59.zip
Additional context Similar errors show up in these issues: https://github.com/strimzi/strimzi-kafka-operator/issues/383 https://github.com/strimzi/strimzi-kafka-operator/issues/1050 https://github.com/strimzi/strimzi-kafka-operator/issues/4964
Increasing the resource claims for the topic operator didn’t change the behaviour.
Zookeeper doesn’t show any errors or timeouts.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (9 by maintainers)
Also running into this.
For the time being, I am defaulting back to zookeeper store instead of kafka streams store by doing the following
FWIW this still seems to be an issue in my case, and I’ve been grateful for the hack above. Currently deploying 0.36.0 using the quickstart.
@LiamClarkeNZ I did not keep my logs unfortunately, but looks like you reproduced it.
Separately, did anyone revert the STRIMZI_USE_ZOOKEEPER_TOPIC_STORE=true setting successfully?
Using ZK for now is fine, but as you note ZK will eventually disappear. So I guess overriding is fine in the short term.
@danlenar’s solution worked for me when I was migrating an existing cluster to a new namespace and ran into an issue where the strimzi-store-topic would not come ready due to InvalidStateStoreException. Posting here in case anyone else embarks on the unenviable task of moving a cluster to a new namespace…
@Cave-Johnson in the Kafka custom resource spec.
Example:
I’m also having this issue and with
templateit’s working fine.I am also having this issue.
The temporary fix from @danlenar is what has helped me at the moment.