harvester: [BUG] stuck at "Upgrade system services" from v1.0.3 to v1.1.0 upgrading progress

Describe the bug

In a progress that upgrade from v1.0.3 to v1.1.0, I stuck at “Upgrade system services” for a long time( about 3 hours). image

Here is my upgrade status:

apiVersion: v1
items:
- apiVersion: harvesterhci.io/v1beta1
  kind: Upgrade
  metadata:
    creationTimestamp: "2022-10-26T07:05:45Z"
    finalizers:
    - wrangler.cattle.io/harvester-upgrade-controller
    generateName: hvst-upgrade-
    generation: 14
    labels:
      harvesterhci.io/latestUpgrade: "true"
      harvesterhci.io/upgradeState: UpgradingSystemServices
    name: hvst-upgrade-fnlkm
    namespace: harvester-system
    resourceVersion: "32750759"
    uid: b96a9405-07b0-4705-b3fc-004155958201
  spec:
    image: ""
    version: v1.1.0
  status:
    conditions:
    - status: Unknown
      type: Completed
    - lastUpdateTime: "2022-10-26T07:11:43Z"
      status: "True"
      type: ImageReady
    - lastUpdateTime: "2022-10-26T07:15:30Z"
      status: "True"
      type: RepoReady
    - lastUpdateTime: "2022-10-26T07:15:50Z"
      status: "True"
      type: NodesPrepared
    - status: Unknown
      type: SystemServicesUpgraded
    imageID: harvester-system/harvester-iso-ncqn7
    nodeStatuses:
      harvester0:
        state: Images preloaded
      harvester1:
        state: Images preloaded
      harvester2:
        state: Images preloaded
    previousVersion: v1.0.3
    repoInfo: |
      release:
          harvester: v1.1.0
          harvesterChart: 1.1.0
          os: Harvester v1.1.0
          kubernetes: v1.24.7+rke2r1
          rancher: v2.6.9
          monitoringChart: 100.1.0+up19.0.3
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Here is my log of jobs/hvst-upgrade-fnlkm-apply-manifests:

harvester: v1.1.0
harvesterChart: 1.1.0
os: Harvester v1.1.0
kubernetes: v1.24.7+rke2r1
rancher: v2.6.9
monitoringChart: 100.1.0+up19.0.3
loggingChart: 100.1.3+up3.17.7
kubevirt: 0.54.0-150400.3.3.2
minUpgradableVersion: 'v1.0.3'
rancherDependencies:
  fleet:
    chart: 100.1.0+up0.4.0
    app: 0.4.0
  fleet-crd:
    chart: 100.1.0+up0.4.0
    app: 0.4.0
  rancher-webhook:
    chart: 1.0.6+up0.2.7
    app: 0.2.7
Current version: 1.0.3
Minimum upgradable version: 1.0.3
Current version is supported.
Executing v1.0.3 pre-hook...
Remove rke.cattle.io/init-node-machine-id label in fleet-local/local cluster
label "rke.cattle.io/init-node-machine-id" not found.
cluster.provisioning.cattle.io/local not labeled
managedchart.management.cattle.io/harvester patched (no change)
managedchart.management.cattle.io/harvester-crd patched (no change)
managedchart.management.cattle.io/rancher-monitoring patched (no change)
managedchart.management.cattle.io/rancher-monitoring-crd patched (no change)
Upgrading Rancher
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   117  100   117    0     0  16714      0 --:--:-- --:--:-- --:--:-- 16714
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 57.9M  100 57.9M    0     0  29.1M      0  0:00:01  0:00:01 --:--:-- 29.1M
time="2022-10-26T07:15:56Z" level=info msg="Extract mapping / => /tmp/upgrade/rancher"
time="2022-10-26T07:15:56Z" level=info msg="Checking local image archives in /tmp/upgrade/images for index.docker.io/rancher/system-agent-installer-rancher:v2.6.9"
time="2022-10-26T07:15:57Z" level=info msg="Extracting file run.sh to /tmp/upgrade/rancher/run.sh"
time="2022-10-26T07:15:58Z" level=info msg="Extracting file rancher-2.6.9.tgz to /tmp/upgrade/rancher/rancher-2.6.9.tgz"
time="2022-10-26T07:15:58Z" level=info msg="Extracting file helm to /tmp/upgrade/rancher/helm"
Rancher values:
antiAffinity: required
bootstrapPassword: admin
features: multi-cluster-management=false,multi-cluster-management-agent=false
hostPort: 0
ingress:
  enabled: false
noDefaultAdmin: false
rancherImage: rancher/rancher
rancherImagePullPolicy: IfNotPresent
rancherImageTag: v2.6.4-harvester3
replicas: -3
systemDefaultRegistry: ""
tls: external
useBundledSystemChart: true

I have even tried some commands from https://docs.harvesterhci.io/v1.1/upgrade/previous-releases/v1-0-0-to-v1-0-1#stuck-in-upgrading-system-service , but it doesn’t work for me.

To Reproduce Steps to reproduce the behavior:

  1. Click Upgrade at harvester UI

Support bundle

supportbundle_6200f644-e4a1-40c1-bf7c-070ccbcb556d_2022-10-26T08-00-43Z.zip

Environment

  • Harvester ISO version: v1.0.3
  • Underlying Infrastructure: Baremetal with PowerEdge R740

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 26 (9 by maintainers)

Most upvoted comments

@Martin-Weiss Your cluster has a similar situation, except Rancher is waiting for a node to be uncordoning:

2022-10-26T09:36:09.399636200Z 2022/10/26 09:36:09 [INFO] [planner] rkecluster fleet-local/local: waiting: uncordoning bootstrap node(s) custom-18f4ec0eb29e: waiting for uncordon to finish

I have realized that the node harvester2 went into “cordoned” but I can not say way.

NAME         STATUS                     ROLES                       AGE    VERSION
harvester1   Ready                      control-plane,etcd,master   118d   v1.22.12+rke2r1
harvester2   Ready,SchedulingDisabled   control-plane,etcd,master   118d   v1.22.12+rke2r1
harvester3   Ready                      control-plane,etcd,master   118d   v1.22.12+rke2r1

I did a manual uncordon but after some time it has been cordoned, again.

The cluster is upgraded from v1.0.0 -> v1.0.1 -> v1.0.2 -> v1.0.3 and then to v1.1.0. The Upgrade resources look good, but did you encounter any failures before?

Not that I can remember.

Please dump machine plan secrets and DM to me if possible (secrets are not captured in the support bundles):

kubectl get secret -n fleet-local --field-selector type=rke.cattle.io/machine-plan -o yaml

DM send…