harvester: [BUG] Upgrade from 1.1.2 to 1.2.0 stuck 50% through the "Upgrading System Service"

Describe the bug Harvester upgrade from 1.1.2 to 1.2.0 has got stuck 50% through the “Upgrading System Service” phase, after downloading everything and preloading the images onto the three nodes.

Looking at the advice on the upgrade notes page the hvst-upgrade apply-manifests job is spewing out this message every 5 seconds.

$ kubectl --context harvester003 -n harvester-system logs hvst-upgrade-6hp8q-apply-manifests-9j9m6 --tail=10
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...

And it’s true - there are two instance-manager-r pods on that node - one 11 hours old running longhorn-instance-manager:v1.4.3 and the other 12 days old running longhorn-instance-manager:v1_20221003.

The issue was reported on Slack and @Vicente-Cheng helpfully inspected the support bundle and said:

I check for the old im-r. The replicas instance of this im-r are all deleted as below checking:

$ kubectl get instancemanager instance-manager-r-1503169c -n longhorn-system -o yaml |yq -e ".status.instances" |grep name: > replica-list.txt
$ cat replica-list.txt |awk '{print $2}' |xargs -I {} kubectl get replicas {} -n longhorn-system

Which I agree all returned.

Error from server (NotFound): replicas.longhorn.io "pvc-<BLAH>" not found

@Vicente-Cheng helpfully continued:

But somehow, they all exist on the instancemanager. That’s why this im-r could not be deleted. I checked the whole attached volumes, and it looks like they are all healthy. So you could directly remove this im-r to make the upgrade continue.

I will do that and comment further…

Support bundle https://rancher-users.slack.com/files/U96EY1QHZ/F05S0Q14CKW/supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2023-09-12t09-10-09z.zip

Environment

  • Harvester ISO version: upgrade from 1.1.2 to 1.2.0
  • Underlying Infrastructure: Bare-metal Asus PN50

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 1
  • Comments: 27 (16 by maintainers)

Most upvoted comments

Support bundle uploaded to Slack here.

Hi @w13915984028,

Making the change to add the CA into apiServerCA of the secret fleet-agent-bootstrap and changing the value in settings.management.cattle.io to the https://<server-hostname> did make something happen, and the fleet-agent carried on to do something.

But looking at the upgrade status in Harvester it says fail and the error is listed as Job has reached the specified backoff limit. After all this time it looks like I may need a way to kick off the process again.

I tried to start the upgrade again by running the following and hitting the “upgrade” button - but that didn’t work . curl https://releases.rancher.com/harvester/v1.2.0/version.yaml | kubectl apply --context=harvester003 -f -

Instead of starting the upgrade I got an error saying either admission webhook "validator.harvesterhci.io" denied the request: managed chart rancher-monitoring is not ready, please wait for it to be ready or admission webhook "validator.harvesterhci.io" denied the request: managed chart rancher-logging is not ready, please wait for it to be ready so I have deleted that version again.

Maybe if I make the hvst-upgrade-6hp8q-apply-manifests job run again… No - that doesn’t work - lots of errors about You are using a client virtctl version that is different from the KubeVirt version running in the cluster.

I think at this stage I’m doomed to have to remove each node from the cluster, one at a time, install from the ISO image of 1.2.0, and re-add each node back into the cluster.

I reproduced this issue in my local cluster, and will update the validated workaround soon.

Thanks for the update and your perseverance on this issue. I’m away at a business conference until Monday evening so I will only be able to test this on Tuesday morning UK time.

@himslm01 it seems to be similar, please further check the log of such a pod fleet-agent-*, if it has below errors:

… level=error msg=“Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post "https://192.168.0.34/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\”: tls: failed to verify certificate: x509: cannot validate certificate for 192.168.0.34 because it doesn’t contain any IP SANs"

related workaround: https://github.com/harvester/harvester/issues/4519#issuecomment-1715692409