strimzi-kafka-operator: [Bug]: KafkaUser does not reach Ready state
Bug Description
When trying to create a KafkaUser I can see the ACLs being applied on Kafka (with use of the kafka-acls.sh script), the user-operator’s reconciliation loop ends successfully, but the user secret is not created.
And as a consequence the KafkaUser itself also does not reach Ready status.
This happens on reconcile after the (fabric8 client) watches (on KafkaUsers and Secrets) died.
When leaving the user-operator running for a while, the watches will die and not respond anymore immediately on creating/deleting a KafkaUser.
Although this is already an issue, I would expect this situation would correct itself upon timed-reconciliation, but it doesn’t.
I was able to track the issue down to createOrReplaceSecret() in KafkaUserOperator.java :
client.secrets().inNamespace(namespace).resource(secret).create();
It looks like this thing returns as if the action completed successfully, but doesn’t do anything.
When restarting the container (or Pod: entity-operator) all watches work correctly again, and the kafka-user (and secret) gets created without a problem. (but then after some time the watches die again and we’re back to square 1).
Steps to reproduce
Reproducing is not that easy, as first you basically have to wait until the watches (on KafkaUser resources en Secret resources) die, but when they do the issue can be reproduced with the following steps:
- create KafkaUser
- wait for the user-operator logs to say
<kafka-user>: reconciled - check with
kafka-acls.shif the ACLs got applied - check the user secret: it is not there
Expected behavior
Timed reconciliation of KafkaUsers should correctly create the user secret together with the ACLs on Kafka.
Strimzi version
0.35.0
Kubernetes version
1.25.5
Installation method
Helm
Infrastructure
AKS
Configuration files and logs
LOGS of the user-operator:
2023-05-30 11:38:55 DEBUG QuotasOperator:71 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): No expected quotas and no existing quotas -> NoOp
2023-05-30 11:38:55 DEBUG QuotasOperator:71 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): No expected quotas and no existing quotas -> NoOp
2023-05-30 11:38:55 DEBUG SimpleAclOperator:89 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): 4 expected Acl rules and 4 existing Acl rules -> Reconciling rules
2023-05-30 11:38:55 DEBUG SimpleAclOperator:188 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): Requesting update of ACLs for user CN=testuser
2023-05-30 11:38:55 DEBUG SimpleAclOperator:78 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): No expected Acl rules and no existing Acl rules -> NoOp
2023-05-30 11:38:55 DEBUG KafkaUserOperator:442 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): Secret infra-kafka/kafka-sec-testuser does not exist, creating it
2023-05-30 11:38:55 INFO UserControllerLoop:120 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): reconciled
2023-05-30 11:38:55 DEBUG StatusDiff:42 - Ignoring Status diff {"op":"replace","path":"/conditions/0/lastTransitionTime","value":"2023-05-30T11:38:55.504712480Z"}
2023-05-30 11:38:55 DEBUG ReconciliationLockManager:59 - Trying to release lock KafkaUser::infra-kafka::testuser
2023-05-30 11:38:55 DEBUG ReconciliationLockManager:62 - Lock KafkaUser::infra-kafka::testuser is not in use anymore and will be removed
2023-05-30 11:38:55 DEBUG AbstractControllerLoop:181 - KafkaUser-ControllerLoop-19: Waiting for next event from work queue
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26 (13 by maintainers)
Thanks for the update on this -> I take that as careful optimism 😄. I think we might release the 0.35.1 GA tomorrow and fingres crossed it will solve the main issue.
Maybe, but when I restarted the user-operator, back then when I encountered this issue, all went normal again. Anyway, I’ve seen it only once, and basically unable to trigger it again … but I’ll keep an eye out for it and will try to capture as much as possible in case it would happen again…