harvester: [BUG] Upgrade from 1.1.2 to 1.2.0 stuck 50% through the "Upgrading System Service"
Describe the bug Harvester upgrade from 1.1.2 to 1.2.0 has got stuck 50% through the “Upgrading System Service” phase, after downloading everything and preloading the images onto the three nodes.
Looking at the advice on the upgrade notes page the hvst-upgrade apply-manifests job is spewing out this message every 5 seconds.
$ kubectl --context harvester003 -n harvester-system logs hvst-upgrade-6hp8q-apply-manifests-9j9m6 --tail=10
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...
And it’s true - there are two instance-manager-r pods on that node - one 11 hours old running longhorn-instance-manager:v1.4.3 and the other 12 days old running longhorn-instance-manager:v1_20221003.
The issue was reported on Slack and @Vicente-Cheng helpfully inspected the support bundle and said:
I check for the old im-r. The replicas instance of this im-r are all deleted as below checking:
$ kubectl get instancemanager instance-manager-r-1503169c -n longhorn-system -o yaml |yq -e ".status.instances" |grep name: > replica-list.txt
$ cat replica-list.txt |awk '{print $2}' |xargs -I {} kubectl get replicas {} -n longhorn-system
Which I agree all returned.
Error from server (NotFound): replicas.longhorn.io "pvc-<BLAH>" not found
@Vicente-Cheng helpfully continued:
But somehow, they all exist on the instancemanager. That’s why this im-r could not be deleted. I checked the whole attached volumes, and it looks like they are all healthy. So you could directly remove this im-r to make the upgrade continue.
I will do that and comment further…
Support bundle https://rancher-users.slack.com/files/U96EY1QHZ/F05S0Q14CKW/supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2023-09-12t09-10-09z.zip
Environment
- Harvester ISO version: upgrade from 1.1.2 to 1.2.0
- Underlying Infrastructure: Bare-metal Asus PN50
Additional context
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 1
- Comments: 27 (16 by maintainers)
Support bundle uploaded to Slack here.
Hi @w13915984028,
Making the change to add the CA into
apiServerCAof the secretfleet-agent-bootstrapand changing thevalueinsettings.management.cattle.ioto thehttps://<server-hostname>did make something happen, and the fleet-agent carried on to do something.But looking at the
upgrade statusin Harvester it saysfailand the error is listed asJob has reached the specified backoff limit. After all this time it looks like I may need a way to kick off the process again.I tried to start the upgrade again by running the following and hitting the “upgrade” button - but that didn’t work .
curl https://releases.rancher.com/harvester/v1.2.0/version.yaml | kubectl apply --context=harvester003 -f -Instead of starting the upgrade I got an error saying either
admission webhook "validator.harvesterhci.io" denied the request: managed chart rancher-monitoring is not ready, please wait for it to be readyoradmission webhook "validator.harvesterhci.io" denied the request: managed chart rancher-logging is not ready, please wait for it to be readyso I have deleted thatversionagain.Maybe if I make the
hvst-upgrade-6hp8q-apply-manifestsjob run again… No - that doesn’t work - lots of errors aboutYou are using a client virtctl version that is different from the KubeVirt version running in the cluster.I think at this stage I’m doomed to have to remove each node from the cluster, one at a time, install from the ISO image of 1.2.0, and re-add each node back into the cluster.
I reproduced this issue in my local cluster, and will update the validated workaround soon.
Thanks for the update and your perseverance on this issue. I’m away at a business conference until Monday evening so I will only be able to test this on Tuesday morning UK time.
@himslm01 it seems to be similar, please further check the log of such a pod
fleet-agent-*, if it has below errors:related workaround: https://github.com/harvester/harvester/issues/4519#issuecomment-1715692409