strimzi-kafka-operator: New Pods are not created because of stuck informers
It looks like in some situations, the informers get stuck and as a result, the StrimziPodSetController does not operate the Pods anymore. This is a problem because it does not recreate them after they are deleted etc. It is not clear what the cause is and whether it is related to Strimzi, Fabric8 or to the user environment. This issue should try to track the issues and try to find common cases.
I also asked in the Fabric8 discussions for any advice to debug this: https://github.com/fabric8io/kubernetes-client/discussions/5152
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 6
- Comments: 15 (7 by maintainers)
We are planning to release the 0.35.1 today. That should give everyone a chance to upgrade without using an RC release. But yes, it might take more time to see if it does still happen or not.
Also observed on Google Kubernetes Engine.
Observed behavior: After deleting a zookeeper pod, the operator isn’t re-creating it. Operator logs:
Operator version: quay.io/strimzi/operator:0.35.0
The
zookeeperstanza of the Kafka CR:Ok, 0.35.1 is out … so please upgrade and let’s see if it helps … hopefully it will.
Hi 👋 we are seeing the same thing. Looks like it’s solved. Great job, thank you @scholzj ! 🎉
Hi, I confirm we (at BlaBlaCar) are also affected by this bug on GKE (with Strimzi
0.34.0), on all our environments. It was spotted first to happen when pods are shut down without the Strimzi operator being specifically informed of it (e.g. by GKE node group upgrades). The SPS resources then show inexact data, saying all Zookeeper/Kafka pods are up, when they are not. Eventually, the log mentioned above appears.In the meanwhile, fortunately, the resolution is easy if the bug happens: we just restart the Strimzi controller, with a rollout restart:
Thanks @scholzj for publishing the bug and for the resolution attempt. I will try to upgrade to
0.35.1-rc1on non-production environments and keep you posted on Wednesday.We prepared 0.35.1-rc1 with updated Kubernetes Client which should hopefully help with this. If you are affected by this issue, please give it a try and let us know if it helped: https://github.com/strimzi/strimzi-kafka-operator/releases/tag/0.35.1-rc1 … I will keep this issue open until hearing more.