harvester: [BUG] Harvester single node upgrade will get `another operation (install/upgrade/rollback) is in progress` error

Describe the bug Harvester single node upgrade will encounter another operation (install/upgrade/rollback) is in progress error after node reboot. Therefore, it will block the next Harvester upgrade or managedChart update. Potentially related to https://github.com/helm/helm/issues/8987#issuecomment-786149813

To Reproduce Steps to reproduce the behavior:

  1. Install a Harvester cluster with an old version, e.g., v1.1.1
  2. upgrade the Harvester cluster to a newer version, e.g., v1.1.2-head.iso
  3. after the upgrade is complete, the upgrade status shows success, then check the harvester managedChart status and it will contain an error of:
conditions:
  - lastUpdateTime: "2023-03-08T05:42:32Z"
    message: 'ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback)
      is in progress]; daemonset.apps harvester-system/kube-vip [progressing] Available:
      0/1; kubevirt.kubevirt.io harvester-system/kubevirt [progressing] Deployin

Expected behavior single node upgrade should not contain the above error.

Support bundle

Environment

  • Harvester ISO version: v1.1.1 upgrade
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context Workaround can be referred to https://github.com/helm/helm/issues/8987#issuecomment-786149813; however, ensure the rollback version is correct.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 25 (14 by maintainers)

Most upvoted comments

The workaround is to roll back the problematic chart.

First, we need to get the helm release name and namespace of a bundle:

$ kubectl get bundles -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1
fleet-local   local-managed-system-agent                    1/1
fleet-local   mcc-harvester                                 1/1
fleet-local   mcc-harvester-crd                             0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   mcc-local-managed-system-upgrade-controller   1/1
fleet-local   mcc-rancher-logging                           1/1
fleet-local   mcc-rancher-logging-crd                       1/1

We know the problematic chart is mcc-harvester-crd. Then we can get the bundle’s chart and namespace with:

$ kubectl get bundle -n fleet-local mcc-harvester-crd -o yaml | yq '.spec.defaultNamespace + " " + .spec.helm.releaseName'
harvester-system harvester-crd

Then, check if the previous revision is sane:

helm history harvester-crd -n harvester-system

Then roll back the chart:

helm rollback harvester-crd -n harvester-system

And check if the bundle becomes Ready again:

kubectl get bundles -A

Note, you can download helm here: https://github.com/helm/helm/releases/tag/v3.11.3

@irishgordo https://github.com/harvester/harvester/pull/3643 reduce the chance of seeing the issue for Harvester-managed charts. fleet-agent-local chart is still out of our control. We should have a document to advise rollbacking the chart if this happens. cc @w13915984028

@irishgordo Yes. please help test with rc4. This should be quite easy to reproduce with rc3 (single node)…

Attempt to fix by scaling fleet agent replicas to 0 before upgrading RKE2 and scale it back to 1 later: https://github.com/harvester/harvester/pull/3641