strimzi-kafka-operator: New Pods are not created because of stuck informers

It looks like in some situations, the informers get stuck and as a result, the StrimziPodSetController does not operate the Pods anymore. This is a problem because it does not recreate them after they are deleted etc. It is not clear what the cause is and whether it is related to Strimzi, Fabric8 or to the user environment. This issue should try to track the issues and try to find common cases.

I also asked in the Fabric8 discussions for any advice to debug this: https://github.com/fabric8io/kubernetes-client/discussions/5152

About this issue

Original URL
State: closed
Created a year ago
Reactions: 6
Comments: 15 (7 by maintainers)

Most upvoted comments

We are planning to release the 0.35.1 today. That should give everyone a chance to upgrade without using an RC release. But yes, it might take more time to see if it does still happen or not.

scholzj on Jun 7, 2023

Also observed on Google Kubernetes Engine.

Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.8-gke.500", GitCommit:"f117e29cb87cfb7e1de32ab4e163fb01ac5d0af9", GitTreeState:"clean", BuildDate:"2023-03-23T10:22:38Z", GoVersion:"go1.19.7 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Observed behavior: After deleting a zookeeper pod, the operator isn’t re-creating it. Operator logs:

2023-05-23 17:28:35 ERROR Util:166 - Reconciliation #9548(timer) Kafka(kafka/test1): Exceeded timeout of 300000ms while waiting for Pods resource test1-zookeeper-0 in namespace kafka to be ready
2023-05-23 17:28:35 ERROR AbstractOperator:260 - Reconciliation #9548(timer) Kafka(kafka/test1): createOrUpdate failed
io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource test1-zookeeper-0 in namespace kafka to be ready
...followed by a stack trace

Operator version: quay.io/strimzi/operator:0.35.0

The zookeeper stanza of the Kafka CR:

  zookeeper:
    replicas: 5
    storage:
      deleteClaim: false
      size: 1Gi
      type: persistent-claim
    template:
      pod:
        metadata:
          annotations:
            cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

cjyar on May 23, 2023

Ok, 0.35.1 is out … so please upgrade and let’s see if it helps … hopefully it will.

scholzj on Jun 7, 2023

It seems like 0.35.1 has fixed the issue. I haven’t been able to reproduce the behavior on any environment.

Hi 👋 we are seeing the same thing. Looks like it’s solved. Great job, thank you @scholzj ! 🎉

Pinimo on Jun 22, 2023

Hi, I confirm we (at BlaBlaCar) are also affected by this bug on GKE (with Strimzi 0.34.0), on all our environments. It was spotted first to happen when pods are shut down without the Strimzi operator being specifically informed of it (e.g. by GKE node group upgrades). The SPS resources then show inexact data, saying all Zookeeper/Kafka pods are up, when they are not. Eventually, the log mentioned above appears.

In the meanwhile, fortunately, the resolution is easy if the bug happens: we just restart the Strimzi controller, with a rollout restart:

kubectl --context "${KUBE_CONTEXT}" --namespace "${KUBE_NAMESPACE}" rollout restart deployment strimzi-cluster-operator

Thanks @scholzj for publishing the bug and for the resolution attempt. I will try to upgrade to 0.35.1-rc1 on non-production environments and keep you posted on Wednesday.

Pinimo on Jun 5, 2023

We prepared 0.35.1-rc1 with updated Kubernetes Client which should hopefully help with this. If you are affected by this issue, please give it a try and let us know if it helped: https://github.com/strimzi/strimzi-kafka-operator/releases/tag/0.35.1-rc1 … I will keep this issue open until hearing more.

scholzj on Jun 4, 2023