emqx-operator: Broker cluster never successfully updates after updating operator from 2.1.4 to 2.2.0

Describe the bug We have multiple EMQX clusters deployed using the emqx-operator v2.1.4.
On our DEV cluster I updated the operator to v2.2.0 and watched the Pods. The new STS and Pod for the core node was created (on DEV it’s a small broker cluster with only 1 core and 1 replicant). The new core node Pod is created and shows as ready in kubectl get pod, but the STS never shows it as ready because the apps.emqx.io/on-serving condition is never set (at all).
When I check the EMQX dashboard I can, however, actually see the new core node and everything looks green there for it. Also, there are no errors in the Pod logs of the new core node.

AFAICT the code which is responsible for creating and then updating that condition (methods updatePodConditions#reconcile and updatePodConditions#checkInCluster) checks the .status.coreNodes field of the EMQX resource. When I look at that resource in the cluster, it does not have that field at all.

The original EMQX resource was created in version apps.emqx.io/v2alpha1 but, of course, was updated in-cluster to apps.emqx.io/v2beta1 during the upgrade of the operator. Not sure if that plays a role.

I haven’t looked any deeper into why status.coreNodes is missing in the EMQX resource yet.

To Reproduce Provision emqx-operator in version 2.1.4. Provision a small EMQX broker cluster using API version apps.emqx.io/v2alpha1, setting both core and replicants count to 1 and using EMQX image version emqx:5.0.23.

Then upgrade the operator to version 2.2.0 and watch the STS and Pods.

Expected behavior Core nodes are updated successfully.

Anything else we need to know?:

Environment details::

  • Kubernetes version: v1.24.10
  • Cloud-provider/provisioner: MS AKS
  • emqx-operator version: 2.1.4 / 2.2.0
  • Install method: Helm

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 52 (22 by maintainers)

Most upvoted comments

I would prefer the 2nd option, as I agree that most people might still operate with a version below 2.2.0 of the operator. From my perspective, most people would probably prefer a clean 2.2.x version.

I think so, let’s do it !

@Rory-Z thanks again for the quick support here!

I would prefer the 2nd option, as I agree that most people might still operate with a version below 2.2.0 of the operator. From my perspective, most people would probably prefer a clean 2.2.x version.

Hi guys, looks 2.2.1 is work, I will close this issue, if you have any question, can reopen it any time.

I can also confirm healthy cluster and no issues. Great work.

Ok, thanks for the responses @Rory-Z, I think I will delete the replicantTemplate spec entirely if this leads to emqx having a good status, no problem.

@Rory-Z works great, haven’t found anything wrong yet, I will test it more today.

@Rory-Z Great, 2.2.1 works for me flawlessly now. Thank you!

Hi guys @tollercode @mariusstaicu @martinzima @clive-jevons , the EMQX Operator 2.2.1 is released, could you please try it.

@mariusstaicu you need to add the labels, that are present on the ‘kind: EMQX’ to the template. It seems the check tries to verify that Pod & STS & EMQX resource match labels.

In my case the ArgoCD application automatically added app.kubernetes.io/instance: emqx-cluster-dev as alabel. I added this to the coreTemplate like this:

coreTemplate:
    metadata:
      labels: {
        app.kubernetes.io/instance: emqx-cluster-dev,
        emqx-cluster: emqx-cluster
      }

That fixed the issue

No joy, unfortunately. After changing the broker version the operator started crashing. So I uninstalled both operator and broker and re-installed using new versions (2.2.0 of operator, 5.1.1 of broker), but then I got stuck with the core nodes updating problem again:

failed to get node statues by API: failed to get API http://<cluster IP redacted>:18083/api/v5/nodes: failed to request API: Get "http://<cluster IP redacted>:18083/api/v5/nodes": dial tcp <cluster IP redacted>:18083: connect: connection refused

Have rolled back to operator 2.1.4 and broker 5.0.23 for now. Will try again with the next releases 😬