harvester: [BUG] Upgrade stuck with first node in Post-draining state
Describe the bug
Upgrading from v1.1.2 to v1.2.0, first node to be updated gets stuck in post-draining state.
To Reproduce Steps to reproduce the behavior:
- Create a version object to enable v1.2.0 upgrade, run pre-check script with all pass, then upgrade.
- Had to edit the Secret fleet-agent-bootstrap to contain the FQDN url instead of VIP url since we use Letsencrypt, SAN doesn’t contain the IP, from this comment: https://github.com/harvester/harvester/issues/4519#issuecomment-1715692409
- Upgrade from v1.1.1 to v1.1.2 rendered rancher-logging-crd in Pending-Upgrade state, this had to be deleted so that it could be recreated by the upgrade which seems to work fine. Status for it was
ErrApplied(1) [Cluster fleet-local/local: cannot re-use a name that is still in use]
Expected behavior Upgrade to finish successfully
Support bundle E-mailed the support bundle to harvester-support-bundle [ at ] suse.com
Environment
- Harvester ISO version: v1.2.0
- Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Baremetal using Supermicro’s Microcloud
Additional context
From kubectl --context pre-harvester01 -n harvester-system logs -f jobs/hvst-upgrade-f79bg-post-drain-pre-harvester01-01:
++ date +%Y%m%d%H%M%S
+ elemental_upgrade_log=/usr/local/upgrade_tmp/elemental-upgrade-20230913120242.log
+ local ret=0
+ mount --bind /usr/local/bin/elemental /host/usr/bin/elemental
+ chroot /host elemental upgrade --logfile /usr/local/upgrade_tmp/elemental-upgrade-20230913120242.log --directory /tmp/tmp.n2JZm1dxDv --config-dir /tmp/tmp.N6rn4F6mKM
Flag --directory has been deprecated, 'directory' is deprecated please use 'system' instead
INFO[2023-09-13T12:02:42Z] Starting elemental version 0.3.1
INFO[2023-09-13T12:02:42Z] reading configuration form '/tmp/tmp.N6rn4F6mKM'
ERRO[2023-09-13T12:02:42Z] Invalid upgrade command setup undefined state partition
elemental upgrade failed with return code: 33
+ ret=33
+ '[' 33 '!=' 0 ']'
+ echo 'elemental upgrade failed with return code: 33'
+ cat /host/usr/local/upgrade_tmp/elemental-upgrade-20230913120242.log
INFO[2023-09-13T12:02:42Z] Starting elemental version 0.3.1
INFO[2023-09-13T12:02:42Z] reading configuration form '/tmp/tmp.N6rn4F6mKM'
ERRO[2023-09-13T12:02:42Z] Invalid upgrade command setup undefined state partition
+ exit 33
+ clean_up_tmp_files
+ '[' -n /host/tmp/tmp.n2JZm1dxDv ']'
+ echo 'Try to unmount /host/tmp/tmp.n2JZm1dxDv...'
+ umount /host/tmp/tmp.n2JZm1dxDv
Try to unmount /host/tmp/tmp.n2JZm1dxDv...
+ '[' -n /host/usr/bin/elemental ']'
+ echo 'Try to unmount /host/usr/bin/elemental...'
+ umount /host/usr/bin/elemental
Try to unmount /host/usr/bin/elemental...
Clean up tmp files...
+ echo 'Clean up tmp files...'
+ '[' -n '' ']'
+ '[' -n /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE ']'
+ echo 'Try to remove /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE...'
+ rm -vf /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE
Try to remove /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE...
removed '/host/usr/local/upgrade_tmp/tmp.OoKOTjupTE'
[Wed Sep 13 12:02:42 UTC 2023] Running "upgrade_node.sh post-drain" errors, will retry after 10 minutes (6 retries)...
Seems like there’s a problem when running the elemental command.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 44 (24 by maintainers)
Detailed analysis can be found on this comment
TL;DR, the incomplete
state.yamland the older state partition namep.statewould cause this problem.state.yamlwould cause elemental can not find the state partition with the correct filesystem labelCOS_STATE.p.statewould cause elemental can not find the state partition either.That only affected the harvester version v1.1.1 and the early version. (This means if you upgrade from the v1.1.1 or early version, you will have a chance to encounter this issue.)
The quick checking you can do before the upgrade.
check the state partition name
If the partition name is
state, we would not encounter this issue. Or you need to check yourstate.yaml.Check the
state.yamlIf the content of
state.yamlis like the above, the elemental command could find the state partition with the correct filesystem labelCOS_STATEThe elemental command could not find the state partition with the following
state.yamlThe simple workaround is to delete the
state.yaml. If we do not contain anystate.yaml, the elemental command will use the default setup. So we could still find thestatepartition with the default filesystem labelCOS_STATE.Just chiming in that I hit the same issue, and was able to successfully upgrade thanks to the workarounds mentioned here. Thanks all!
summarize the issue and workaround:
Hi @egrist, you can run
sudo mount -o remount,rw /run/initramfs/cos-stateto remount the state partition and be able to write, and thensudo mount -o remount,ro /run/initramfs/cos-statewhen done.@Vicente-Cheng revalidation passes from: v1.1.1 -> v1.1.2 -> v1.2.1-rc1 😄
This looks good, I’ll go ahead and close this out 😄
Hi @irishgordo, I thought this issue was related to the incomplete
state.yamland could be verified with the Test plan from https://github.com/harvester/harvester/pull/4566.Could you elaborate more about how you verify this? Thanks!
@egrist Good to know, it upgraded. the workaround works.
@Vicente-Cheng thanks for the manual file of new upgrade job
@davidcassany @frelon please check the elemental debug log, thanks.
@Vicente-Cheng Since I edited
state.yamlon the first node and added the job, it completed successfully quite fast so I didn’t get the time to edit theupgrade_node.shscript. However it continued to node number 2 which is very similar to the other nodes, the same problem occured. On node number 2 I added the--debugflag in theupgrade_node.sh, and then fetched the log which I’ll post below. After it failed, I edited thestate.yaml, deleted the pod (not the actual job 😛) so that a new pod started and completed successfully.Log from
pod/hvst-upgrade-f79bg-post-drain-pre-harvester01-02-hr2mdwithchroot $HOST_DIR elemental upgrade --debugenabled:Adding
label: COS_STATEto/run/initramfs/cos-state/state.yamlsolved this particular issue.EDIT: Harvester v1.1.2 upgrade to v1.2.0 completed successfully.
Hi @egrist, I regenerate your post-drain yaml file as follows
Please use
kubectl applyto re-apply it.Then please help to add
--debugto theelementalcommand. (Please refer to the following steps)You could find the related pod from the following command
Then go into this pod
Then try to modify the bash script
/usr/local/bin/upgrade_node.shlike following:This job will repeatedly execute for 10 minutes if failure. So keep waiting for the next run. Thanks!
cc @bk201
@davidcassany thanks. when any information is needed, @egrist will help to get from his server.
Harvester team now tries workaround to continue the stucking upgrade.
Probably I missed it somewhere in the discussion, but I am wondering about the firmware and the partition table type. Could it be this error raised using MDOS partition types instead of GPT? The expectation from elemental-toolkit is that MDOS partition tables are used in BIOS context only.
The expectation in GPT partition tables is that even if the filesystem label data is lost somehow partitions can still be found by the partition label, which is not customizable and hence it is always an elemental hardcoded value.
@egrist I am calling help from
elemental-toolkit. They will join.@egrist : please try to edit this yaml, add
label: COS_STATEunderstate(same level asactive), and kill the current pending POD