strimzi-kafka-operator: Topic Operator failing to start with io.vertx.core.VertxException: Thread blocked

Describe the bug When deploying a very simple cluster with the topicOperator enabled, the topicOperator container fails to start. The logs for the container report a blocked thread. The k8s liveness check eventually kills the container.

2021-12-16 00:16:50,79115 WARN  [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 2542 ms, time limit is 2000 ms
2021-12-16 00:16:51,79090 WARN  [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 3542 ms, time limit is 2000 ms
2021-12-16 00:16:52,79034 WARN  [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 4541 ms, time limit is 2000 ms
2021-12-16 00:16:53,79105 WARN  [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 5542 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at jdk.internal.misc.Unsafe.park(Native Method) ~[?:?]
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) ~[?:?]
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) ~[?:?]
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) ~[?:?]
	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998) ~[?:?]
	at io.apicurio.registry.utils.ConcurrentUtil.get(ConcurrentUtil.java:35) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
	at io.apicurio.registry.utils.ConcurrentUtil.get(ConcurrentUtil.java:27) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
	at io.apicurio.registry.utils.ConcurrentUtil.result(ConcurrentUtil.java:54) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
	at io.strimzi.operator.topic.Session.lambda$start$9(Session.java:198) ~[io.strimzi.topic-operator-0.26.0.jar:0.26.0]
	at io.strimzi.operator.topic.Session$$Lambda$278/0x0000000840319840.handle(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureImpl$3.onSuccess(FutureImpl.java:141) ~[io.vertx.vertx-core-4.1.5.jar:4.1.5]
	at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54) ~[io.vertx.vertx-core-4.1.5.jar:4.1.5]
	at io.vertx.core.impl.future.FutureBase$$Lambda$293/0x000000084031e040.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[io.netty.netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]

To Reproduce Steps to reproduce the behavior:

  1. Install Strimzi Operator using the 0.26.0 helm chart
  2. Create a Cluster manifest:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-basic
spec:
  kafka:
    version: 3.0.0
    replicas: 1
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
    storage:
      type: ephemeral
  zookeeper:   
    replicas: 1
    storage:
      type: ephemeral
  entityOperator:
    topicOperator: {}
    userOperator: {}
  1. Apply the manifest with kubectl apply -f kafka-basic.yaml
  2. Watch the topic operator logs with kubectl logs deploy/kafka-basic-entity-operator -c topic-operator

Expected behavior The topic operator starts correctly.

Environment:

  • Strimzi version: 0.26.0
  • Installation method: Helm chart
  • Kubernetes cluster: Kubernetes 1.20.7
  • Infrastructure: Amazon EKS

YAML files and logs Thanks for the handy script! report-16-12-2021_11-26-59.zip

Additional context Similar errors show up in these issues: https://github.com/strimzi/strimzi-kafka-operator/issues/383 https://github.com/strimzi/strimzi-kafka-operator/issues/1050 https://github.com/strimzi/strimzi-kafka-operator/issues/4964

Increasing the resource claims for the topic operator didn’t change the behaviour.

Zookeeper doesn’t show any errors or timeouts.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Also running into this.

For the time being, I am defaulting back to zookeeper store instead of kafka streams store by doing the following

  entityOperator:
    template:
      topicOperatorContainer:
        env:
        - name: STRIMZI_USE_ZOOKEEPER_TOPIC_STORE
          value: "true"

FWIW this still seems to be an issue in my case, and I’ve been grateful for the hack above. Currently deploying 0.36.0 using the quickstart.

@LiamClarkeNZ I did not keep my logs unfortunately, but looks like you reproduced it.

Separately, did anyone revert the STRIMZI_USE_ZOOKEEPER_TOPIC_STORE=true setting successfully?

Using ZK for now is fine, but as you note ZK will eventually disappear. So I guess overriding is fine in the short term.

Also running into this.

For the time being, I am defaulting back to zookeeper store instead of kafka streams store by doing the following

  entityOperator:
    template:
      topicOperatorContainer:
        env:
        - name: STRIMZI_USE_ZOOKEEPER_TOPIC_STORE
          value: "true"

@danlenar’s solution worked for me when I was migrating an existing cluster to a new namespace and ran into an issue where the strimzi-store-topic would not come ready due to InvalidStateStoreException. Posting here in case anyone else embarks on the unenviable task of moving a cluster to a new namespace…

@Cave-Johnson in the Kafka custom resource spec.

Example:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
spec:
  entityOperator:
    template:
      topicOperatorContainer:
        env:
        - name: STRIMZI_USE_ZOOKEEPER_TOPIC_STORE
          value: true
  # ...

I’m also having this issue and with template it’s working fine.

I am also having this issue.

The temporary fix from @danlenar is what has helped me at the moment.