strimzi-kafka-operator: [Bug]: KafkaUser does not reach Ready state

Bug Description

When trying to create a KafkaUser I can see the ACLs being applied on Kafka (with use of the kafka-acls.sh script), the user-operator’s reconciliation loop ends successfully, but the user secret is not created. And as a consequence the KafkaUser itself also does not reach Ready status.

This happens on reconcile after the (fabric8 client) watches (on KafkaUsers and Secrets) died. When leaving the user-operator running for a while, the watches will die and not respond anymore immediately on creating/deleting a KafkaUser.
Although this is already an issue, I would expect this situation would correct itself upon timed-reconciliation, but it doesn’t.

I was able to track the issue down to createOrReplaceSecret() in KafkaUserOperator.java :

client.secrets().inNamespace(namespace).resource(secret).create();

It looks like this thing returns as if the action completed successfully, but doesn’t do anything.


When restarting the container (or Pod: entity-operator) all watches work correctly again, and the kafka-user (and secret) gets created without a problem. (but then after some time the watches die again and we’re back to square 1).

Steps to reproduce

Reproducing is not that easy, as first you basically have to wait until the watches (on KafkaUser resources en Secret resources) die, but when they do the issue can be reproduced with the following steps:

  1. create KafkaUser
  2. wait for the user-operator logs to say <kafka-user>: reconciled
  3. check with kafka-acls.sh if the ACLs got applied
  4. check the user secret: it is not there

Expected behavior

Timed reconciliation of KafkaUsers should correctly create the user secret together with the ACLs on Kafka.

Strimzi version

0.35.0

Kubernetes version

1.25.5

Installation method

Helm

Infrastructure

AKS

Configuration files and logs

LOGS of the user-operator:

2023-05-30 11:38:55 DEBUG QuotasOperator:71 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): No expected quotas and no existing quotas -> NoOp
2023-05-30 11:38:55 DEBUG QuotasOperator:71 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): No expected quotas and no existing quotas -> NoOp
2023-05-30 11:38:55 DEBUG SimpleAclOperator:89 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): 4 expected Acl rules and 4 existing Acl rules -> Reconciling rules
2023-05-30 11:38:55 DEBUG SimpleAclOperator:188 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): Requesting update of ACLs for user CN=testuser
2023-05-30 11:38:55 DEBUG SimpleAclOperator:78 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): No expected Acl rules and no existing Acl rules -> NoOp
2023-05-30 11:38:55 DEBUG KafkaUserOperator:442 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): Secret infra-kafka/kafka-sec-testuser does not exist, creating it
2023-05-30 11:38:55 INFO  UserControllerLoop:120 - Reconciliation #821(timer) KafkaUser(infra-kafka/testuser): reconciled
2023-05-30 11:38:55 DEBUG StatusDiff:42 - Ignoring Status diff {"op":"replace","path":"/conditions/0/lastTransitionTime","value":"2023-05-30T11:38:55.504712480Z"}
2023-05-30 11:38:55 DEBUG ReconciliationLockManager:59 - Trying to release lock KafkaUser::infra-kafka::testuser
2023-05-30 11:38:55 DEBUG ReconciliationLockManager:62 - Lock KafkaUser::infra-kafka::testuser is not in use anymore and will be removed
2023-05-30 11:38:55 DEBUG AbstractControllerLoop:181 - KafkaUser-ControllerLoop-19: Waiting for next event from work queue

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 26 (13 by maintainers)

Most upvoted comments

I have had the user-operator running 24 hours now, and I have not seen the Informers die again, so I it looks like the kubernetes-client v6.7.0 fixes the issue around the Informers/Watches.

Thanks for the update on this -> I take that as careful optimism 😄. I think we might release the 0.35.1 GA tomorrow and fingres crossed it will solve the main issue.

I’m afraid I have no idea what would cause that. Creating the KafkaUser resource does trigger multiple reconciliations as there are multiple events happening => The initial KafkaUser ADDED, Secret Added, KafkaUser MODIFIED with new status etc. That is why the lock is there. But if you see Secret events fire in ADD and DELETE events in rapid succession that seems weird. It could be for example if something else is relabeling the secret.

Maybe, but when I restarted the user-operator, back then when I encountered this issue, all went normal again. Anyway, I’ve seen it only once, and basically unable to trigger it again … but I’ll keep an eye out for it and will try to capture as much as possible in case it would happen again…