harvester: [BUG] Upgrade stuck in upgrading first node: Job was active longer than specified deadline

Describe the bug

When upgrading from v1.0.3 GA to v1.1.0-rc2, the upgrade stuck in the stage of upgrading the first node (management role) Job was active longer than specified deadline

image

To Reproduce Steps to reproduce the behavior:

  1. Prepare a v1.0.3 GA on 4 nodes Harvester
  2. Prepare the offline download image
  3. Upgrade from v1.0.3 GA to v1.1.0-rc2
  4. The upgrade stuck while upgrading the first node and can’t proceed

Expected behavior

Should be able to upgrading all nodes and complete all upgrade stages to v1.1.0-rc2

Support bundle

supportbundle_1e1d02cc-9096-4ccd-9b5a-97bbc818a692_2022-10-07T08-25-38Z.zip

Environment

  • Harvester ISO version: v1.1.0-rc2
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 4 nodes Harvester on bare machines

Additional context Add any other context about the problem here.

  • Environment setup before upgrade

  • 3 VMs image image

  • Node status image

  • 3 Backup status image

  • 3 volume status image

  • 1 Network VLAN (1216) status image

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

The cause is preload job takes more than 15 minutes to load images (We use SUC plan to spawn jobs, and its default timeout is 15 minutes). In this release we have more images to preload. I’ve checked David’s two environments, one has SAS HDD, and the other one has an SMR HDD. Neither can survive the images preload stage.

containerd.log

time="2022-10-11T03:19:23.951938936Z" level=info msg="StartContainer for \"d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182\" returns successfully"

<...
time="2022-10-11T03:32:12.900717689Z" level=info msg="ImageCreate event &ImageCreate{Name:sha256:57070fe87937e599433c74eeea6b8be7f804dc121d2f42960e39a9923cf6ec15,Labels:map[string]string{io
.cri-containerd.image: managed,},XXX_unrecognized:[],}"
time="2022-10-11T03:32:12.901859283Z" level=info msg="ImageUpdate event &ImageUpdate{Name:docker.io/rancher/harvester-cluster-repo:v1.1.0-rc2,Labels:map[string]string{io.cri-containerd.imag
e: managed,},XXX_unrecognized:[],}"
time="2022-10-11T03:34:20.046354983Z" level=info msg="StopContainer for \"d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182\" with timeout 30 (s)"
time="2022-10-11T03:34:20.047112338Z" level=info msg="Stop container \"d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182\" with signal terminated"
time="2022-10-11T03:34:20.621575157Z" level=info msg="StopContainer for \"19c87bd3460f813216db8f2e6e7bdd1c4a776ddf6ff62d3cb990dd2f36f10222\" with timeout 60 (s)"
time="2022-10-11T03:34:20.622383185Z" level=info msg="Stop container \"19c87bd3460f813216db8f2e6e7bdd1c4a776ddf6ff62d3cb990dd2f36f10222\" with signal terminated"
time="2022-10-11T03:34:23.513366732Z" level=error msg="(*service).Write failed" error="rpc error: code = Canceled desc = context canceled" ref=tar-e9606b43b68a94da4bb940d9b519a2691e8fd2569e
823811fc3555da41c972b0/layer.tar total=94569472
time="2022-10-11T03:34:28.033964681Z" level=error msg="collecting metrics for d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182" error="cgroups: cgroup deleted: unknown"
time="2022-10-11T03:34:30.107030186Z" level=error msg="failed to handle container TaskExit event &TaskExit{ContainerID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,ID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,Pid:16704,ExitStatus:143,ExitedAt:2022-10-11 03:34:20.105466463 +0000 UTC,XXX_unrecognized:[],}" error="failed to stop container: context deadline exceeded: unknown"
time="2022-10-11T03:34:31.632701731Z" level=info msg="TaskExit event &TaskExit{ContainerID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,ID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,Pid:16704,ExitStatus:143,ExitedAt:2022-10-11 03:34:20.105466463 +0000 UTC,XXX_unrecognized:[],}"
time="2022-10-11T03:34:33.633018154Z" level=error msg="get state for d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182" error="context deadline exceeded: unknown"

Ideally, the job should retry, but we mark an upgrade as a failure and delete upgrade repo VM after the first failure. There are some ways to work around this on slow disks:

  • Increase the job deadline before upgrading:
    $ cat > /tmp/fix.yaml <<EOF
    spec:
      values:
        systemUpgradeJobActiveDeadlineSeconds: "3600"
    EOF
    
    $ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge
    $ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller
    
  • Delete the upgrade and start over again (This might need several times). Preload time will reduce due to some images are already in.
  • Manually preload images from ISO on each node first.

Rerun the upgrade with the workaround stated in https://github.com/harvester/harvester/issues/2894#issuecomment-1274069690

By preparing all Harvester cluster nodes on SSD drive And increase the job deadline before clicking the upgrade button on dashboard:

$ cat > /tmp/fix.yaml <<EOF
spec:
  values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
EOF

$ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge
$ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller
  • We can complete the image preload stage image

  • Proceed to upgrade node image

  • And Successfully upgrade to v1.1.0-rc2 image

@khushboo-rancher, I tried running the upgrade to v1.1.0-rc3 on 3 nodes Harvester, all nodes are installed on SSD disk. Without setting the increase job workaround https://github.com/harvester/harvester/issues/2894#issuecomment-1274069690 In this trial we can still upgrade successfully.

Thank @connorkuehl. Upstream PR merged: https://github.com/rancher/system-upgrade-controller/pull/244 Also need to check if the SUC change will make it to Rancher 2.7.5. If not, we probably need to wait for the next milestone.

To add a suggestion for HDD users in the upgrade doc first.

Thanks Khushboo, My issue has been fixed by adding more value to requested memory for Prometheus Node Exporter

Error: on “kubectl get bundles -A”

fleet-local mcc-rancher-monitoring 0/1 ErrApplied(1) [Cluster fleet-local/local: cannot patch “rancher-monitoring-prometheus-node-exporter” with kind DaemonSet: DaemonSet.apps “rancher-monitoring-prometheus-node-exporter” is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: “10Gi”: must be less than or equal to memory limit] fleet-local mcc-rancher-monitoring-crd 1/1

  1. we have edited the value for 10Gi for Prometheus Node Exporter Requested memory to 20 MI and worked good . The upgrade continued after this change

@TachunLin @bk201 thanks for sharing the workaround. I ran into a similar situation in my homelab currently where this same thing occurred. But the workaround seems to have fixed it. Should we mention this workaround in release notes for v1.1.0? I was running an instance of Harvester that was virtualized, not running on bare metal, using /vda based mount point for virtio drive - the underlying hardware however was a SATA3 Based SSD, on my 1U SuperMicro: https://www.supermicro.com/products/archive/motherboard/x9drd-it_ (using 1 of the 2 SATA 3 Ports)

@khushboo-rancher

The failure is due to the storage & network read/write speed, the upgrade tries to spread images to all nodes, that’s a big impact to storage and network, and the storage may be the main bottleneck. In our local test on KVM based cluster, that seems OK.

The workaround will be test it in newer/faster servers.

cc @bk201 @starbops