harvester: [BUG] Harvester single node upgrade will get `another operation (install/upgrade/rollback) is in progress` error

Describe the bug Harvester single node upgrade will encounter another operation (install/upgrade/rollback) is in progress error after node reboot. Therefore, it will block the next Harvester upgrade or managedChart update. Potentially related to https://github.com/helm/helm/issues/8987#issuecomment-786149813

To Reproduce Steps to reproduce the behavior:

Install a Harvester cluster with an old version, e.g., v1.1.1
upgrade the Harvester cluster to a newer version, e.g., v1.1.2-head.iso
after the upgrade is complete, the upgrade status shows success, then check the harvester managedChart status and it will contain an error of:

conditions:
  - lastUpdateTime: "2023-03-08T05:42:32Z"
    message: 'ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback)
      is in progress]; daemonset.apps harvester-system/kube-vip [progressing] Available:
      0/1; kubevirt.kubevirt.io harvester-system/kubevirt [progressing] Deployin

Expected behavior single node upgrade should not contain the above error.

Support bundle

Environment

Harvester ISO version: v1.1.1 upgrade
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context Workaround can be referred to https://github.com/helm/helm/issues/8987#issuecomment-786149813; however, ensure the rollback version is correct.

About this issue

Original URL
State: closed
Created a year ago
Comments: 25 (14 by maintainers)

Most upvoted comments

The workaround is to roll back the problematic chart.

First, we need to get the helm release name and namespace of a bundle:

$ kubectl get bundles -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1
fleet-local   local-managed-system-agent                    1/1
fleet-local   mcc-harvester                                 1/1
fleet-local   mcc-harvester-crd                             0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   mcc-local-managed-system-upgrade-controller   1/1
fleet-local   mcc-rancher-logging                           1/1
fleet-local   mcc-rancher-logging-crd                       1/1

We know the problematic chart is mcc-harvester-crd. Then we can get the bundle’s chart and namespace with:

$ kubectl get bundle -n fleet-local mcc-harvester-crd -o yaml | yq '.spec.defaultNamespace + " " + .spec.helm.releaseName'
harvester-system harvester-crd

Then, check if the previous revision is sane:

helm history harvester-crd -n harvester-system

Then roll back the chart:

helm rollback harvester-crd -n harvester-system

And check if the bundle becomes Ready again:

kubectl get bundles -A

Note, you can download helm here: https://github.com/helm/helm/releases/tag/v3.11.3

bk201 on May 3, 2023

We’ll use the issue https://github.com/harvester/harvester/issues/3675 in 1.2.0 to track upstream issue https://github.com/rancher/fleet/issues/637

bk201 on Mar 30, 2023

@irishgordo https://github.com/harvester/harvester/pull/3643 reduce the chance of seeing the issue for Harvester-managed charts. fleet-agent-local chart is still out of our control. We should have a document to advise rollbacking the chart if this happens. cc @w13915984028

bk201 on Mar 29, 2023

@irishgordo Yes. please help test with rc4. This should be quite easy to reproduce with rc3 (single node)…

bk201 on Mar 24, 2023

Attempt to fix by scaling fleet agent replicas to 0 before upgrading RKE2 and scale it back to 1 later: https://github.com/harvester/harvester/pull/3641

bk201 on Mar 13, 2023