harvester: [BUG] Upgrade stuck with first node in Post-draining state

Describe the bug Upgrading from v1.1.2 to v1.2.0, first node to be updated gets stuck in post-draining state. image

To Reproduce Steps to reproduce the behavior:

  1. Create a version object to enable v1.2.0 upgrade, run pre-check script with all pass, then upgrade.
  2. Had to edit the Secret fleet-agent-bootstrap to contain the FQDN url instead of VIP url since we use Letsencrypt, SAN doesn’t contain the IP, from this comment: https://github.com/harvester/harvester/issues/4519#issuecomment-1715692409
  3. Upgrade from v1.1.1 to v1.1.2 rendered rancher-logging-crd in Pending-Upgrade state, this had to be deleted so that it could be recreated by the upgrade which seems to work fine. Status for it was ErrApplied(1) [Cluster fleet-local/local: cannot re-use a name that is still in use]

Expected behavior Upgrade to finish successfully

Support bundle E-mailed the support bundle to harvester-support-bundle [ at ] suse.com

Environment

  • Harvester ISO version: v1.2.0
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Baremetal using Supermicro’s Microcloud

Additional context From kubectl --context pre-harvester01 -n harvester-system logs -f jobs/hvst-upgrade-f79bg-post-drain-pre-harvester01-01:

++ date +%Y%m%d%H%M%S
+ elemental_upgrade_log=/usr/local/upgrade_tmp/elemental-upgrade-20230913120242.log
+ local ret=0
+ mount --bind /usr/local/bin/elemental /host/usr/bin/elemental
+ chroot /host elemental upgrade --logfile /usr/local/upgrade_tmp/elemental-upgrade-20230913120242.log --directory /tmp/tmp.n2JZm1dxDv --config-dir /tmp/tmp.N6rn4F6mKM
Flag --directory has been deprecated, 'directory' is deprecated please use 'system' instead
INFO[2023-09-13T12:02:42Z] Starting elemental version 0.3.1             
INFO[2023-09-13T12:02:42Z] reading configuration form '/tmp/tmp.N6rn4F6mKM' 
ERRO[2023-09-13T12:02:42Z] Invalid upgrade command setup undefined state partition 
elemental upgrade failed with return code: 33
+ ret=33
+ '[' 33 '!=' 0 ']'
+ echo 'elemental upgrade failed with return code: 33'
+ cat /host/usr/local/upgrade_tmp/elemental-upgrade-20230913120242.log
INFO[2023-09-13T12:02:42Z] Starting elemental version 0.3.1             
INFO[2023-09-13T12:02:42Z] reading configuration form '/tmp/tmp.N6rn4F6mKM' 
ERRO[2023-09-13T12:02:42Z] Invalid upgrade command setup undefined state partition 
+ exit 33
+ clean_up_tmp_files
+ '[' -n /host/tmp/tmp.n2JZm1dxDv ']'
+ echo 'Try to unmount /host/tmp/tmp.n2JZm1dxDv...'
+ umount /host/tmp/tmp.n2JZm1dxDv
Try to unmount /host/tmp/tmp.n2JZm1dxDv...
+ '[' -n /host/usr/bin/elemental ']'
+ echo 'Try to unmount /host/usr/bin/elemental...'
+ umount /host/usr/bin/elemental
Try to unmount /host/usr/bin/elemental...
Clean up tmp files...
+ echo 'Clean up tmp files...'
+ '[' -n '' ']'
+ '[' -n /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE ']'
+ echo 'Try to remove /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE...'
+ rm -vf /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE
Try to remove /host/usr/local/upgrade_tmp/tmp.OoKOTjupTE...
removed '/host/usr/local/upgrade_tmp/tmp.OoKOTjupTE'
[Wed Sep 13 12:02:42 UTC 2023] Running "upgrade_node.sh post-drain" errors, will retry after 10 minutes (6 retries)...

Seems like there’s a problem when running the elemental command.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 44 (24 by maintainers)

Most upvoted comments

Detailed analysis can be found on this comment

TL;DR, the incomplete state.yaml and the older state partition name p.state would cause this problem.

  1. The incomplete state.yaml would cause elemental can not find the state partition with the correct filesystem label COS_STATE.
  2. The old partition name p.state would cause elemental can not find the state partition either.

That only affected the harvester version v1.1.1 and the early version. (This means if you upgrade from the v1.1.1 or early version, you will have a chance to encounter this issue.)

The quick checking you can do before the upgrade.

  • check the state partition name

    # Finding the state partition
    harvester-node-0:~ # lsblk  -f |grep COS_STATE
    ├─vda4 ext4   1.0   COS_STATE       de068d8f-76be-443a-94c1-1bb830e00961    9.6G    29% /run/initramfs/cos-state
    harvester-node-0:~ # udevadm info /dev/vda4 |grep PART
    ...
    E: PARTNAME=state
    ...
    E: ID_PART_ENTRY_NAME=state
    ...
    

    If the partition name is state, we would not encounter this issue. Or you need to check your state.yaml.

  • Check the state.yaml

    # Autogenerated file by elemental client, do not edit
    
    date: "2023-09-23T17:59:24Z"
    ...
    state:
        label: COS_STATE <-- this is used for filesystem label
        active:
            source: dir:///run/rootfsbase
            label: COS_ACTIVE
            fs: ext2
        passive:
            source: dir:///run/rootfsbase
            label: COS_PASSIVE
            fs: ext2
    ...
    

    If the content of state.yaml is like the above, the elemental command could find the state partition with the correct filesystem label COS_STATE

    The elemental command could not find the state partition with the following state.yaml

    # Autogenerated file by elemental client, do not edit
    
    date: "2023-09-13T08:31:42Z"
    state:
        # we are missing `label` here.
        active:
            source: dir:///tmp/tmp.01deNrXNEC
            label: COS_ACTIVE
            fs: ext2
        passive: null
    

The simple workaround is to delete the state.yaml. If we do not contain any state.yaml, the elemental command will use the default setup. So we could still find the state partition with the default filesystem label COS_STATE.

Just chiming in that I hit the same issue, and was able to successfully upgrade thanks to the workarounds mentioned here. Thanks all!

summarize the issue and workaround:

(1) Upgrade hang on `post-drain` of certain node, the POD log has "elemental upgrade failed with return code: 33"

the error log looks like:

kubectl --context pre-harvester01 -n harvester-system logs -f jobs/hvst-upgrade-f79bg-post-drain-pre-harvester01-01:

....

ERRO[2023-09-13T12:02:42Z] Invalid upgrade command setup undefined state partition 
elemental upgrade failed with return code: 33
+ ret=33
...

(2) Confirm the issue via

pre-harvester01-01:~ # cat /run/initramfs/cos-state/state.yaml
# Autogenerated file by elemental client, do not edit

date: "2023-09-13T08:31:42Z"
state:
    label: COS_STATE /// if there is no such line, then the bug will happen
    active:
        source: dir:///tmp/tmp.01deNrXNEC
        label: COS_ACTIVE
        fs: ext2
    passive: null

(3) Workaround:

a. remount the initramfs to be r/w
sudo mount -o remount,rw /run/initramfs/cos-state to remount the state partition and be able to write, 


b. add `label: COS_STATE` to /run/initramfs/cos-state/state.yaml
...
state:
    label: COS_STATE
    active:
        source: dir:///tmp/tmp.01deNrXNEC
...

c.remount the initramfs to be read only
sudo mount -o remount,ro /run/initramfs/cos-state when done.


d. delete the current stuck POD (NOT the JOB), e.g.;  the replaced POD will continue and success
kubectl delete pod -n harvester-system hvst-upgrade-f79bg-post-drain-pre-harvester-*...

@w13915984028 That file seems stored in a read only mount on the first host where this file exists, what’s the correct procedure to edit it?

pre-harvester01-01:~ # mount -l | grep /run/initramfs/cos-state
/dev/sdd3 on /run/initramfs/cos-state type ext4 (ro,relatime) [COS_STATE]

Hi @egrist, you can run sudo mount -o remount,rw /run/initramfs/cos-state to remount the state partition and be able to write, and then sudo mount -o remount,ro /run/initramfs/cos-state when done.

@Vicente-Cheng revalidation passes from: v1.1.1 -> v1.1.2 -> v1.2.1-rc1 😄 Screenshot from 2023-10-05 12-02-27 Screenshot from 2023-10-05 11-58-37 This looks good, I’ll go ahead and close this out 😄

@Vicente-Cheng for testing this I validated that the workaround worked in: v1.1.2 -> Version: v1.2-b5acb14e-head

Setting up a storage-network (900 VLAN) utilizing a combination of openvswitch and custom network to funnel ipxe-vm traffic w/ tagged vlans back out to router {all on port 15 as in the screenshot over dedicated nic on laptop to support ipxe vms} I was able to see the bundle issue, implemented the workaround when encountering it, then continued the upgrade.

Do you feel we should go ahead and close this out as the workaround does indeed work when customers encounter the bundle issue with a storage-network?

Hi @irishgordo, I thought this issue was related to the incomplete state.yaml and could be verified with the Test plan from https://github.com/harvester/harvester/pull/4566.

Could you elaborate more about how you verify this? Thanks!

@egrist Good to know, it upgraded. the workaround works.

@Vicente-Cheng thanks for the manual file of new upgrade job

@davidcassany @frelon please check the elemental debug log, thanks.

@Vicente-Cheng Since I edited state.yaml on the first node and added the job, it completed successfully quite fast so I didn’t get the time to edit the upgrade_node.sh script. However it continued to node number 2 which is very similar to the other nodes, the same problem occured. On node number 2 I added the --debug flag in the upgrade_node.sh, and then fetched the log which I’ll post below. After it failed, I edited the state.yaml, deleted the pod (not the actual job 😛) so that a new pod started and completed successfully.

Log from pod/hvst-upgrade-f79bg-post-drain-pre-harvester01-02-hr2md with chroot $HOST_DIR elemental upgrade --debugenabled:

+++ dirname /usr/local/bin/upgrade_node.sh
++ cd /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/lib.sh
++ UPGRADE_NAMESPACE=harvester-system
++ UPGRADE_REPO_URL=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso
++ UPGRADE_REPO_VM_NAME=upgrade-repo-hvst-upgrade-f79bg
++ UPGRADE_REPO_RELEASE_FILE=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/harvester-release.yaml
++ UPGRADE_REPO_SQUASHFS_IMAGE=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/rootfs.squashfs
++ UPGRADE_REPO_BUNDLE_ROOT=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/bundle
++ UPGRADE_REPO_BUNDLE_METADATA=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/bundle/metadata.yaml
++ CACHED_BUNDLE_METADATA=
++ HOST_DIR=/host
+ UPGRADE_TMP_DIR=/host/usr/local/upgrade_tmp
+ mkdir -p /host/usr/local/upgrade_tmp
+ case $1 in
+ command_post_drain
+ wait_repo
++ get_repo_vm_status
++ kubectl get virtualmachines.kubevirt.io upgrade-repo-hvst-upgrade-f79bg -n harvester-system '-o=jsonpath={.status.printableStatus}'
+ [[ Running == \R\u\n\n\i\n\g ]]
+ curl -sfL http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/harvester-release.yaml
harvester: v1.2.0
harvesterChart: 1.2.0
os: Harvester v1.2.0
kubernetes: v1.25.9+rke2r1
rancher: v2.7.5
monitoringChart: 102.0.0+up40.1.2
loggingChart: 102.0.0+up3.17.10
kubevirt: 0.54.0-150400.3.19.1
minUpgradableVersion: 'v1.1.2'
rancherDependencies:
  fleet:
    chart: 102.1.0+up0.7.0
    app: 0.7.0
  fleet-crd:
    chart: 102.1.0+up0.7.0
    app: 0.7.0
  rancher-webhook:
    chart: 2.0.5+up0.3.5
    app: 0.3.5
+ detect_repo
++ mktemp --suffix=.yaml
+ release_file=/tmp/tmp.NYKakgbfbj.yaml
+ download_file http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/harvester-release.yaml /tmp/tmp.NYKakgbfbj.yaml
+ local url=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/harvester-release.yaml
+ local output=/tmp/tmp.NYKakgbfbj.yaml
+ echo 'Downloading the file from "http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/harvester-release.yaml" to "/tmp/tmp.NYKakgbfbj.yaml"...'
+ local i=0
+ [[ 0 -lt 100 ]]
Downloading the file from "http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/harvester-release.yaml" to "/tmp/tmp.NYKakgbfbj.yaml"...
+ curl -sSfL http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/harvester-release.yaml -o /tmp/tmp.NYKakgbfbj.yaml
+ break
++ yq -e e .harvester /tmp/tmp.NYKakgbfbj.yaml
+ REPO_HARVESTER_VERSION=v1.2.0
++ yq -e e .harvesterChart /tmp/tmp.NYKakgbfbj.yaml
+ REPO_HARVESTER_CHART_VERSION=1.2.0
++ yq -e e .os /tmp/tmp.NYKakgbfbj.yaml
+ REPO_OS_PRETTY_NAME='Harvester v1.2.0'
+ REPO_OS_VERSION=v1.2.0
++ yq -e e .kubernetes /tmp/tmp.NYKakgbfbj.yaml
+ REPO_RKE2_VERSION=v1.25.9+rke2r1
++ yq -e e .rancher /tmp/tmp.NYKakgbfbj.yaml
+ REPO_RANCHER_VERSION=v2.7.5
++ yq -e e .monitoringChart /tmp/tmp.NYKakgbfbj.yaml
+ REPO_MONITORING_CHART_VERSION=102.0.0+up40.1.2
++ yq -e e .loggingChart /tmp/tmp.NYKakgbfbj.yaml
+ REPO_LOGGING_CHART_VERSION=102.0.0+up3.17.10
++ yq -e e .rancherDependencies.fleet.chart /tmp/tmp.NYKakgbfbj.yaml
+ REPO_FLEET_CHART_VERSION=102.1.0+up0.7.0
++ yq -e e .rancherDependencies.fleet.app /tmp/tmp.NYKakgbfbj.yaml
+ REPO_FLEET_APP_VERSION=0.7.0
++ yq -e e .rancherDependencies.fleet-crd.chart /tmp/tmp.NYKakgbfbj.yaml
+ REPO_FLEET_CRD_CHART_VERSION=102.1.0+up0.7.0
++ yq -e e .rancherDependencies.fleet-crd.app /tmp/tmp.NYKakgbfbj.yaml
+ REPO_FLEET_CRD_APP_VERSION=0.7.0
++ yq -e e .rancherDependencies.rancher-webhook.chart /tmp/tmp.NYKakgbfbj.yaml
+ REPO_RANCHER_WEBHOOK_CHART_VERSION=2.0.5+up0.3.5
++ yq -e e .rancherDependencies.rancher-webhook.app /tmp/tmp.NYKakgbfbj.yaml
+ REPO_RANCHER_WEBHOOK_APP_VERSION=0.3.5
++ yq -e e .kubevirt /tmp/tmp.NYKakgbfbj.yaml
+ REPO_KUBEVIRT_VERSION=0.54.0-150400.3.19.1
++ yq e .minUpgradableVersion /tmp/tmp.NYKakgbfbj.yaml
+ REPO_HARVESTER_MIN_UPGRADABLE_VERSION=v1.1.2
+ '[' -z v1.2.0 ']'
+ '[' -z 1.2.0 ']'
+ '[' -z v1.2.0 ']'
+ '[' -z v1.25.9+rke2r1 ']'
+ '[' -z v2.7.5 ']'
+ '[' -z 102.0.0+up40.1.2 ']'
+ '[' -z 102.0.0+up3.17.10 ']'
+ '[' -z 102.1.0+up0.7.0 ']'
+ '[' -z 0.7.0 ']'
+ '[' -z 102.1.0+up0.7.0 ']'
+ '[' -z 0.7.0 ']'
+ '[' -z 2.0.5+up0.3.5 ']'
+ '[' -z 0.3.5 ']'
+ '[' -z 0.54.0-150400.3.19.1 ']'
++ mktemp --suffix=.yaml
+ CACHED_BUNDLE_METADATA=/tmp/tmp.2zOqKu5naF.yaml
+ download_file http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/bundle/metadata.yaml /tmp/tmp.2zOqKu5naF.yaml
+ local url=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/bundle/metadata.yaml
+ local output=/tmp/tmp.2zOqKu5naF.yaml
+ echo 'Downloading the file from "http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/bundle/metadata.yaml" to "/tmp/tmp.2zOqKu5naF.yaml"...'
+ local i=0
Downloading the file from "http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/bundle/metadata.yaml" to "/tmp/tmp.2zOqKu5naF.yaml"...
+ [[ 0 -lt 100 ]]
+ curl -sSfL http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/bundle/metadata.yaml -o /tmp/tmp.2zOqKu5naF.yaml
+ break
+ set_max_pods
+ cat
+ wait_rke2_upgrade
++ echo -n v1.25.9+rke2r1
++ sed 's/-rc[[:digit:]]*//g'
+ REPO_RKE2_VERSION_WITHOUT_RC=v1.25.9+rke2r1
++ get_node_rke2_version
++ kubectl get node pre-harvester01-02 -o yaml
++ yq -e e .status.nodeInfo.kubeletVersion -
+ '[' v1.25.9+rke2r1 = v1.25.9+rke2r1 ']'
+ clean_rke2_archives
+ yq -e -o=json e .images.rke2 /tmp/tmp.2zOqKu5naF.yaml
+ jq -r '.[] | [.list, .archive] | @tsv'
+ IFS='	'
+ read -r list archive
++ basename /harvester/images/rke2-images.linux-amd64-v1.25.9-rke2r1.tar.zst
+ rm -f /host/var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64-v1.25.9-rke2r1.tar.zst
++ basename /harvester/images-lists/rke2-images.linux-amd64-v1.25.9-rke2r1.txt
+ rm -f /host/var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64-v1.25.9-rke2r1.txt
+ IFS='	'
+ read -r list archive
++ basename /harvester/images/rke2-images-multus.linux-amd64-v1.25.9-rke2r1.tar.zst
+ rm -f /host/var/lib/rancher/rke2/agent/images/rke2-images-multus.linux-amd64-v1.25.9-rke2r1.tar.zst
++ basename /harvester/images-lists/rke2-images-multus.linux-amd64-v1.25.9-rke2r1.txt
+ rm -f /host/var/lib/rancher/rke2/agent/images/rke2-images-multus.linux-amd64-v1.25.9-rke2r1.txt
+ IFS='	'
+ read -r list archive
+ kubectl taint node pre-harvester01-02 kubevirt.io/drain-
error: taint "kubevirt.io/drain" not found
+ true
+ convert_nodenetwork_to_vlanconfig
+ '[' -z '' ']'
+ detect_upgrade
++ kubectl get upgrades.harvesterhci.io hvst-upgrade-f79bg -n harvester-system -o yaml
+ upgrade_obj='apiVersion: harvesterhci.io/v1beta1
kind: Upgrade
metadata:
  annotations:
    harvesterhci.io/replica-replenishment-wait-interval: "600"
  creationTimestamp: "2023-09-13T09:39:02Z"
  finalizers:
  - wrangler.cattle.io/harvester-upgrade-controller
  generateName: hvst-upgrade-
  generation: 28
  labels:
    harvesterhci.io/latestUpgrade: "true"
    harvesterhci.io/upgradeState: UpgradingNodes
  name: hvst-upgrade-f79bg
  namespace: harvester-system
  resourceVersion: "37124388"
  uid: fb197b26-d8c1-4b42-b7a1-38f02ff6dae8
spec:
  image: ""
  logEnabled: true
  version: v1.2.0
status:
  conditions:
  - status: Unknown
    type: Completed
  - lastUpdateTime: "2023-09-13T09:39:18Z"
    status: "True"
    type: LogReady
  - lastUpdateTime: "2023-09-13T09:40:54Z"
    status: "True"
    type: ImageReady
  - lastUpdateTime: "2023-09-13T09:42:29Z"
    status: "True"
    type: RepoReady
  - lastUpdateTime: "2023-09-13T10:17:16Z"
    status: "True"
    type: NodesPrepared
  - lastUpdateTime: "2023-09-13T10:22:00Z"
    status: "True"
    type: SystemServicesUpgraded
  - status: Unknown
    type: NodesUpgraded
  imageID: harvester-system/harvester-iso-zwxj6
  nodeStatuses:
    pre-harvester01-01:
      state: Succeeded
    pre-harvester01-02:
      state: Post-draining
    pre-harvester01-03:
      state: Images preloaded
    pre-harvester01-04:
      state: Images preloaded
  previousVersion: v1.1.2
  repoInfo: |
    release:
        harvester: v1.2.0
        harvesterChart: 1.2.0
        os: Harvester v1.2.0
        kubernetes: v1.25.9+rke2r1
        rancher: v2.7.5
        monitoringChart: 102.0.0+up40.1.2
        minUpgradableVersion: v1.1.2
  upgradeLog: hvst-upgrade-f79bg-upgradelog'
++ echo 'apiVersion: harvesterhci.io/v1beta1
kind: Upgrade
metadata:
  annotations:
    harvesterhci.io/replica-replenishment-wait-interval: "600"
  creationTimestamp: "2023-09-13T09:39:02Z"
  finalizers:
  - wrangler.cattle.io/harvester-upgrade-controller
  generateName: hvst-upgrade-
  generation: 28
  labels:
    harvesterhci.io/latestUpgrade: "true"
    harvesterhci.io/upgradeState: UpgradingNodes
  name: hvst-upgrade-f79bg
  namespace: harvester-system
  resourceVersion: "37124388"
  uid: fb197b26-d8c1-4b42-b7a1-38f02ff6dae8
spec:
  image: ""
  logEnabled: true
  version: v1.2.0
status:
  conditions:
  - status: Unknown
    type: Completed
  - lastUpdateTime: "2023-09-13T09:39:18Z"
    status: "True"
    type: LogReady
  - lastUpdateTime: "2023-09-13T09:40:54Z"
    status: "True"
    type: ImageReady
  - lastUpdateTime: "2023-09-13T09:42:29Z"
    status: "True"
    type: RepoReady
  - lastUpdateTime: "2023-09-13T10:17:16Z"
    status: "True"
    type: NodesPrepared
  - lastUpdateTime: "2023-09-13T10:22:00Z"
    status: "True"
    type: SystemServicesUpgraded
  - status: Unknown
    type: NodesUpgraded
  imageID: harvester-system/harvester-iso-zwxj6
  nodeStatuses:
    pre-harvester01-01:
      state: Succeeded
    pre-harvester01-02:
      state: Post-draining
    pre-harvester01-03:
      state: Images preloaded
    pre-harvester01-04:
      state: Images preloaded
  previousVersion: v1.1.2
  repoInfo: |
    release:
        harvester: v1.2.0
        harvesterChart: 1.2.0
        os: Harvester v1.2.0
        kubernetes: v1.25.9+rke2r1
        rancher: v2.7.5
        monitoringChart: 102.0.0+up40.1.2
        minUpgradableVersion: v1.1.2
  upgradeLog: hvst-upgrade-f79bg-upgradelog'
++ yq e .status.previousVersion -
+ UPGRADE_PREVIOUS_VERSION=v1.1.2
+ detect_node_current_harvester_version
+ NODE_CURRENT_HARVESTER_VERSION=
+ local harvester_release_file=/host/etc/harvester-release.yaml
+ '[' -f /host/etc/harvester-release.yaml ']'
++ yq e .harvester /host/etc/harvester-release.yaml
NODE_CURRENT_HARVESTER_VERSION is: v1.1.2
The UPGRADE_PREVIOUS_VERSION is v1.1.2, NODE_CURRENT_HARVESTER_VERSION is v1.1.2, will check nodenetwork upgrade node option
+ NODE_CURRENT_HARVESTER_VERSION=v1.1.2
There is nothing to do in nodenetwork
+ echo 'NODE_CURRENT_HARVESTER_VERSION is: v1.1.2'
+ echo 'The UPGRADE_PREVIOUS_VERSION is v1.1.2, NODE_CURRENT_HARVESTER_VERSION is v1.1.2, will check nodenetwork upgrade node option'
+ test v1.1.2 '!=' v1.0.3
+ test v1.1.2 '!=' v1.0.3
+ echo 'There is nothing to do in nodenetwork'
+ return
+ upgrade_os
+ trap clean_up_tmp_files EXIT
++ source /host/etc/os-release
+++ NAME='SLE Micro'
+++ VERSION=5.3
+++ VERSION_ID=5.3
+++ PRETTY_NAME='Harvester v1.1.2'
+++ ID=sle-micro-rancher
+++ ID_LIKE=suse
+++ ANSI_COLOR='0;32'
+++ CPE_NAME=cpe:/o:suse:sle-micro-rancher:5.3
+++ VARIANT=Harvester
+++ VARIANT_ID=Harvester-20230420
+++ GRUB_ENTRY_NAME='Harvester v1.1.2'
++ echo Harvester v1.1.2
+ CURRENT_OS_VERSION='Harvester v1.1.2'
+ '[' 'Harvester v1.2.0' = 'Harvester v1.1.2' ']'
+ '[' -n '' ']'
++ mktemp -p /host/usr/local/upgrade_tmp
+ tmp_rootfs_squashfs=/host/usr/local/upgrade_tmp/tmp.JbyjvexfYr
+ download_file http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/rootfs.squashfs /host/usr/local/upgrade_tmp/tmp.JbyjvexfYr
+ local url=http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/rootfs.squashfs
+ local output=/host/usr/local/upgrade_tmp/tmp.JbyjvexfYr
+ echo 'Downloading the file from "http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/rootfs.squashfs" to "/host/usr/local/upgrade_tmp/tmp.JbyjvexfYr"...'
+ local i=0
+ [[ 0 -lt 100 ]]
Downloading the file from "http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/rootfs.squashfs" to "/host/usr/local/upgrade_tmp/tmp.JbyjvexfYr"...
+ curl -sSfL http://upgrade-repo-hvst-upgrade-f79bg.harvester-system/harvester-iso/rootfs.squashfs -o /host/usr/local/upgrade_tmp/tmp.JbyjvexfYr
+ break
++ mktemp -d -p /host/tmp
+ tmp_rootfs_mount=/host/tmp/tmp.tl898qOpsa
+ mount /host/usr/local/upgrade_tmp/tmp.JbyjvexfYr /host/tmp/tmp.tl898qOpsa
++ mktemp -d -p /host/tmp
+ tmp_elemental_config_dir=/host/tmp/tmp.J0dMMEmbEv
+ cat
+ new_elemental_cli=/usr/local/bin/elemental
+ target_elemental_cli=/host/usr/bin/elemental
++ date +%Y%m%d%H%M%S
+ elemental_upgrade_log=/usr/local/upgrade_tmp/elemental-upgrade-20230919113309.log
+ local ret=0
+ mount --bind /usr/local/bin/elemental /host/usr/bin/elemental
+ chroot /host elemental upgrade --debug --logfile /usr/local/upgrade_tmp/elemental-upgrade-20230919113309.log --directory /tmp/tmp.tl898qOpsa --config-dir /tmp/tmp.J0dMMEmbEv
Flag --directory has been deprecated, 'directory' is deprecated please use 'system' instead
DEBU[2023-09-19T11:33:09Z] Starting elemental version 0.3.1 on commit bab71d15 
INFO[2023-09-19T11:33:09Z] reading configuration form '/tmp/tmp.J0dMMEmbEv' 
DEBU[2023-09-19T11:33:09Z] Full config loaded: &v1.RunConfig{
  Reboot: false,
  PowerOff: false,
  EjectCD: false,
  Config: v1.Config{
    Logger: &v1.logrusWrapper{ // p0
      Logger: &logrus.Logger{
        Out: &io.multiWriter{},
        Hooks: logrus.LevelHooks{},
        Formatter: &logrus.TextFormatter{
          ForceColors: true,
          DisableColors: false,
          ForceQuote: false,
          DisableQuote: false,
          EnvironmentOverrideColors: false,
          DisableTimestamp: false,
          FullTimestamp: true,
          TimestampFormat: "",
          DisableSorting: false,
          SortingFunc: ,
          DisableLevelTruncation: false,
          PadLevelText: false,
          QuoteEmptyFields: false,
          FieldMap: logrus.FieldMap(nil),
          CallerPrettyfier: ,
        },
        ReportCaller: false,
        Level: 5,
        ExitFunc: os.Exit,
        BufferPool: nil,
      },
    },
    Fs: &vfs.osfs{}, // p1
    Mounter: &mount.Mounter{},
    Runner: &v1.RealRunner{ // p2
      Logger: p0,
    },
    Syscall: &v1.RealSyscall{},
    CloudInitRunner: &cloudinit.YipCloudInitRunner{},
    ImageExtractor: v1.OCIImageExtractor{},
    Client: &http.Client{},
    Platform: &v1.Platform{
      OS: "linux",
      Arch: "x86_64",
      GolangArch: "amd64",
    },
    Cosign: false,
    Verify: false,
    CosignPubKey: "",
    LocalImage: false,
    Arch: "",
    SquashFsCompressionConfig: []string{
      "-comp",
      "xz",
      "-Xbcj",
      "x86",
    },
    SquashFsNoCompression: false,
    CloudInitPaths: []string{
      "/system/oem",
      "/oem/",
      "/usr/local/cloud-config/",
    },
    Strict: false,
  },
} 
DEBU[2023-09-19T11:33:09Z] Loaded upgrade UpgradeSpec: &v1.UpgradeSpec{
  RecoveryUpgrade: false,
  Active: v1.Image{
    File: "",
    Label: "",
    Size: 3072,
    FS: "",
    Source: &v1.ImageSource{},
    MountPoint: "",
    LoopDevice: "",
  },
  Recovery: v1.Image{
    File: "/run/cos/recovery/cOS/transition.img",
    Label: "COS_SYSTEM",
    Size: 0,
    FS: "ext2",
    Source: &v1.ImageSource{},
    MountPoint: "/run/cos/transition",
    LoopDevice: "",
  },
  GrubDefEntry: "",
  Passive: v1.Image{
    File: "",
    Label: "",
    Size: 0,
    FS: "",
    Source: nil,
    MountPoint: "",
    LoopDevice: "",
  },
  Partitions: v1.ElementalPartitions{
    BIOS: nil,
    EFI: nil,
    OEM: &v1.Partition{
      Name: "p.oem",
      FilesystemLabel: "COS_OEM",
      Size: 64,
      FS: "ext4",
      Flags: nil,
      MountPoint: "/oem",
      Path: "/dev/sdd2",
      Disk: "/dev/sdd",
    },
    Recovery: &v1.Partition{
      Name: "p.recovery",
      FilesystemLabel: "COS_RECOVERY",
      Size: 8192,
      FS: "ext4",
      Flags: nil,
      MountPoint: "/run/cos/recovery",
      Path: "/dev/sdd4",
      Disk: "/dev/sdd",
    },
    State: nil,
    Persistent: &v1.Partition{
      Name: "p.persistent",
      FilesystemLabel: "COS_PERSISTENT",
      Size: 97213,
      FS: "ext4",
      Flags: nil,
      MountPoint: "/etc/systemd",
      Path: "/dev/sdd5",
      Disk: "/dev/sdd",
    },
  },
  State: &v1.InstallState{
    Date: "2023-09-13T08:43:10Z",
    Partitions: map[string]*v1.PartitionState{
      "state": &v1.PartitionState{
        FSLabel: "",
        Images: map[string]*v1.ImageState{
          "active": &v1.ImageState{
            Source: &v1.ImageSource{},
            SourceMetadata: nil,
            Label: "COS_ACTIVE",
            FS: "ext2",
          },
          "passive": nil,
        },
      },
    },
  },
} 
ERRO[2023-09-19T11:33:09Z] Invalid upgrade command setup undefined state partition 
+ ret=33
+ '[' 33 '!=' 0 ']'
elemental upgrade failed with return code: 33
+ echo 'elemental upgrade failed with return code: 33'
+ cat /host/usr/local/upgrade_tmp/elemental-upgrade-20230919113309.log
DEBU[2023-09-19T11:33:09Z] Starting elemental version 0.3.1 on commit bab71d15 
INFO[2023-09-19T11:33:09Z] reading configuration form '/tmp/tmp.J0dMMEmbEv' 
DEBU[2023-09-19T11:33:09Z] Full config loaded: &v1.RunConfig{
  Reboot: false,
  PowerOff: false,
  EjectCD: false,
  Config: v1.Config{
    Logger: &v1.logrusWrapper{ // p0
      Logger: &logrus.Logger{
        Out: &io.multiWriter{},
        Hooks: logrus.LevelHooks{},
        Formatter: &logrus.TextFormatter{
          ForceColors: true,
          DisableColors: false,
          ForceQuote: false,
          DisableQuote: false,
          EnvironmentOverrideColors: false,
          DisableTimestamp: false,
          FullTimestamp: true,
+ exit 33
+ clean_up_tmp_files
          TimestampFormat: "",
          DisableSorting: false,
          SortingFunc: ,
          DisableLevelTruncation: false,
          PadLevelText: false,
          QuoteEmptyFields: false,
          FieldMap: logrus.FieldMap(nil),
          CallerPrettyfier: ,
        },
        ReportCaller: false,
        Level: 5,
        ExitFunc: os.Exit,
        BufferPool: nil,
+ '[' -n /host/tmp/tmp.tl898qOpsa ']'
      },
+ echo 'Try to unmount /host/tmp/tmp.tl898qOpsa...'
    },
+ umount /host/tmp/tmp.tl898qOpsa
    Fs: &vfs.osfs{}, // p1
    Mounter: &mount.Mounter{},
    Runner: &v1.RealRunner{ // p2
      Logger: p0,
    },
    Syscall: &v1.RealSyscall{},
    CloudInitRunner: &cloudinit.YipCloudInitRunner{},
    ImageExtractor: v1.OCIImageExtractor{},
    Client: &http.Client{},
    Platform: &v1.Platform{
      OS: "linux",
      Arch: "x86_64",
      GolangArch: "amd64",
    },
    Cosign: false,
    Verify: false,
    CosignPubKey: "",
    LocalImage: false,
    Arch: "",
    SquashFsCompressionConfig: []string{
      "-comp",
      "xz",
      "-Xbcj",
      "x86",
    },
    SquashFsNoCompression: false,
    CloudInitPaths: []string{
      "/system/oem",
      "/oem/",
      "/usr/local/cloud-config/",
    },
    Strict: false,
  },
} 
DEBU[2023-09-19T11:33:09Z] Loaded upgrade UpgradeSpec: &v1.UpgradeSpec{
  RecoveryUpgrade: false,
  Active: v1.Image{
    File: "",
    Label: "",
    Size: 3072,
    FS: "",
    Source: &v1.ImageSource{},
    MountPoint: "",
    LoopDevice: "",
  },
  Recovery: v1.Image{
    File: "/run/cos/recovery/cOS/transition.img",
    Label: "COS_SYSTEM",
    Size: 0,
    FS: "ext2",
    Source: &v1.ImageSource{},
    MountPoint: "/run/cos/transition",
    LoopDevice: "",
  },
  GrubDefEntry: "",
  Passive: v1.Image{
    File: "",
    Label: "",
    Size: 0,
    FS: "",
    Source: nil,
    MountPoint: "",
    LoopDevice: "",
  },
  Partitions: v1.ElementalPartitions{
    BIOS: nil,
    EFI: nil,
    OEM: &v1.Partition{
      Name: "p.oem",
      FilesystemLabel: "COS_OEM",
      Size: 64,
      FS: "ext4",
      Flags: nil,
      MountPoint: "/oem",
      Path: "/dev/sdd2",
      Disk: "/dev/sdd",
    },
    Recovery: &v1.Partition{
      Name: "p.recovery",
      FilesystemLabel: "COS_RECOVERY",
      Size: 8192,
      FS: "ext4",
      Flags: nil,
      MountPoint: "/run/cos/recovery",
      Path: "/dev/sdd4",
      Disk: "/dev/sdd",
    },
    State: nil,
    Persistent: &v1.Partition{
      Name: "p.persistent",
      FilesystemLabel: "COS_PERSISTENT",
      Size: 97213,
      FS: "ext4",
      Flags: nil,
      MountPoint: "/etc/systemd",
      Path: "/dev/sdd5",
      Disk: "/dev/sdd",
    },
  },
  State: &v1.InstallState{
    Date: "2023-09-13T08:43:10Z",
    Partitions: map[string]*v1.PartitionState{
      "state": &v1.PartitionState{
        FSLabel: "",
        Images: map[string]*v1.ImageState{
          "active": &v1.ImageState{
            Source: &v1.ImageSource{},
            SourceMetadata: nil,
            Label: "COS_ACTIVE",
            FS: "ext2",
          },
          "passive": nil,
        },
      },
    },
  },
} 
ERRO[2023-09-19T11:33:09Z] Invalid upgrade command setup undefined state partition 
Try to unmount /host/tmp/tmp.tl898qOpsa...
+ '[' -n /host/usr/bin/elemental ']'
+ echo 'Try to unmount /host/usr/bin/elemental...'
Try to unmount /host/usr/bin/elemental...
+ umount /host/usr/bin/elemental
Clean up tmp files...
Try to remove /host/usr/local/upgrade_tmp/tmp.JbyjvexfYr...
+ echo 'Clean up tmp files...'
+ '[' -n '' ']'
+ '[' -n /host/usr/local/upgrade_tmp/tmp.JbyjvexfYr ']'
+ echo 'Try to remove /host/usr/local/upgrade_tmp/tmp.JbyjvexfYr...'
+ rm -vf /host/usr/local/upgrade_tmp/tmp.JbyjvexfYr
removed '/host/usr/local/upgrade_tmp/tmp.JbyjvexfYr'
[Tue Sep 19 11:33:09 UTC 2023] Running "upgrade_node.sh post-drain" errors, will retry after 10 minutes (2 retries)...

Adding label: COS_STATE to /run/initramfs/cos-state/state.yaml solved this particular issue.

EDIT: Harvester v1.1.2 upgrade to v1.2.0 completed successfully.

Hi @egrist, I regenerate your post-drain yaml file as follows

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    harvesterhci.io/node: pre-harvester01-01
    harvesterhci.io/upgrade: hvst-upgrade-f79bg
    harvesterhci.io/upgradeComponent: node
    harvesterhci.io/upgradeJobType: post-drain
  name: hvst-upgrade-f79bg-post-drain-pre-harvester01-01
  namespace: harvester-system
  ownerReferences:
  - apiVersion: harvesterhci.io/v1beta1
    kind: Upgrade
    name: hvst-upgrade-f79bg
    uid: fb197b26-d8c1-4b42-b7a1-38f02ff6dae8
spec:
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  suspend: false
  template:
    metadata:
      creationTimestamp: "null"
      labels:
        harvesterhci.io/upgrade: hvst-upgrade-f79bg
        harvesterhci.io/upgradeComponent: node
        harvesterhci.io/upgradeJobType: post-drain
        job-name: hvst-upgrade-f79bg-post-drain-pre-harvester01-01
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - pre-harvester01-01
      containers:
      - args:
        - post-drain
        command:
        - do_upgrade_node.sh
        env:
        - name: HARVESTER_UPGRADE_NAME
          value: hvst-upgrade-f79bg
        - name: HARVESTER_UPGRADE_NODE_NAME
          value: pre-harvester01-01
        - name: HARVESTER_UPGRADE_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: rancher/harvester-upgrade:v1.2.0
        imagePullPolicy: IfNotPresent
        name: apply
        resources: {}
        securityContext:
          capabilities:
            add:
            - CAP_SYS_BOOT
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host
          name: host-root
      dnsPolicy: ClusterFirstWithHostNet
      hostIPC: true
      hostNetwork: true
      hostPID: true
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: harvester
      serviceAccountName: harvester
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node.kubernetes.io/unschedulable
        operator: Exists
      - effect: NoExecute
        key: node-role.kubernetes.io/control-plane
        operator: Exists
      - effect: NoSchedule
        key: kubevirt.io/drain
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
      - effect: NoSchedule
        key: kubernetes.io/arch
        operator: Equal
        value: amd64
      - effect: NoSchedule
        key: kubernetes.io/arch
        operator: Equal
        value: arm64
      - effect: NoSchedule
        key: kubernetes.io/arch
        operator: Equal
        value: arm
      volumes:
      - hostPath:
          path: /
          type: Directory
        name: host-root
  ttlSecondsAfterFinished: 604800

Please use kubectl apply to re-apply it.

Then please help to add --debug to the elemental command. (Please refer to the following steps)

You could find the related pod from the following command

$ kubectl get pod --selector=harvesterhci.io/upgradeJobType=post-drain -n harvester-system

# you will see your post-drain job, should be named `hvst-upgrade-f79bg-post-drain-pre-harvester01-01`

Then go into this pod

$ kubectl exec -it pods/<the above name> -n harvester-system -- /bin/bash

Then try to modify the bash script /usr/local/bin/upgrade_node.sh like following:

chroot $HOST_DIR elemental upgrade \   <-- this line replace to following

chroot $HOST_DIR elemental upgrade --debug \

This job will repeatedly execute for 10 minutes if failure. So keep waiting for the next run. Thanks!

cc @bk201

@davidcassany thanks. when any information is needed, @egrist will help to get from his server.

Harvester team now tries workaround to continue the stucking upgrade.

Probably I missed it somewhere in the discussion, but I am wondering about the firmware and the partition table type. Could it be this error raised using MDOS partition types instead of GPT? The expectation from elemental-toolkit is that MDOS partition tables are used in BIOS context only.

The expectation in GPT partition tables is that even if the filesystem label data is lost somehow partitions can still be found by the partition label, which is not customizable and hence it is always an elemental hardcoded value.

@egrist I am calling help from elemental-toolkit. They will join.

@egrist : please try to edit this yaml, add label: COS_STATE under state (same level as active), and kill the current pending POD

pre-harvester01-01:~ # cat /run/initramfs/cos-state/state.yaml
# Autogenerated file by elemental client, do not edit

date: "2023-09-13T08:31:42Z"
state:
    label: COS_STATE /// ====> add to here
    active:
        source: dir:///tmp/tmp.01deNrXNEC
        label: COS_ACTIVE
        fs: ext2
    passive: null