harvester: [BUG] Upgrade stuck in upgrading first node: Job was active longer than specified deadline

Describe the bug

When upgrading from v1.0.3 GA to v1.1.0-rc2, the upgrade stuck in the stage of upgrading the first node (management role) Job was active longer than specified deadline

To Reproduce Steps to reproduce the behavior:

Prepare a v1.0.3 GA on 4 nodes Harvester
Prepare the offline download image
Upgrade from v1.0.3 GA to v1.1.0-rc2
The upgrade stuck while upgrading the first node and can’t proceed

Expected behavior

Should be able to upgrading all nodes and complete all upgrade stages to v1.1.0-rc2

Support bundle

supportbundle_1e1d02cc-9096-4ccd-9b5a-97bbc818a692_2022-10-07T08-25-38Z.zip

Environment

Harvester ISO version: v1.1.0-rc2
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 4 nodes Harvester on bare machines

Additional context Add any other context about the problem here.

Environment setup before upgrade
3 VMs
Node status
3 Backup status
3 volume status
1 Network VLAN (1216) status

About this issue

Original URL
State: open
Created 2 years ago
Comments: 19 (6 by maintainers)

Most upvoted comments

The cause is preload job takes more than 15 minutes to load images (We use SUC plan to spawn jobs, and its default timeout is 15 minutes). In this release we have more images to preload. I’ve checked David’s two environments, one has SAS HDD, and the other one has an SMR HDD. Neither can survive the images preload stage.

containerd.log

time="2022-10-11T03:19:23.951938936Z" level=info msg="StartContainer for \"d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182\" returns successfully"

<...
time="2022-10-11T03:32:12.900717689Z" level=info msg="ImageCreate event &ImageCreate{Name:sha256:57070fe87937e599433c74eeea6b8be7f804dc121d2f42960e39a9923cf6ec15,Labels:map[string]string{io
.cri-containerd.image: managed,},XXX_unrecognized:[],}"
time="2022-10-11T03:32:12.901859283Z" level=info msg="ImageUpdate event &ImageUpdate{Name:docker.io/rancher/harvester-cluster-repo:v1.1.0-rc2,Labels:map[string]string{io.cri-containerd.imag
e: managed,},XXX_unrecognized:[],}"
time="2022-10-11T03:34:20.046354983Z" level=info msg="StopContainer for \"d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182\" with timeout 30 (s)"
time="2022-10-11T03:34:20.047112338Z" level=info msg="Stop container \"d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182\" with signal terminated"
time="2022-10-11T03:34:20.621575157Z" level=info msg="StopContainer for \"19c87bd3460f813216db8f2e6e7bdd1c4a776ddf6ff62d3cb990dd2f36f10222\" with timeout 60 (s)"
time="2022-10-11T03:34:20.622383185Z" level=info msg="Stop container \"19c87bd3460f813216db8f2e6e7bdd1c4a776ddf6ff62d3cb990dd2f36f10222\" with signal terminated"
time="2022-10-11T03:34:23.513366732Z" level=error msg="(*service).Write failed" error="rpc error: code = Canceled desc = context canceled" ref=tar-e9606b43b68a94da4bb940d9b519a2691e8fd2569e
823811fc3555da41c972b0/layer.tar total=94569472
time="2022-10-11T03:34:28.033964681Z" level=error msg="collecting metrics for d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182" error="cgroups: cgroup deleted: unknown"
time="2022-10-11T03:34:30.107030186Z" level=error msg="failed to handle container TaskExit event &TaskExit{ContainerID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,ID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,Pid:16704,ExitStatus:143,ExitedAt:2022-10-11 03:34:20.105466463 +0000 UTC,XXX_unrecognized:[],}" error="failed to stop container: context deadline exceeded: unknown"
time="2022-10-11T03:34:31.632701731Z" level=info msg="TaskExit event &TaskExit{ContainerID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,ID:d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182,Pid:16704,ExitStatus:143,ExitedAt:2022-10-11 03:34:20.105466463 +0000 UTC,XXX_unrecognized:[],}"
time="2022-10-11T03:34:33.633018154Z" level=error msg="get state for d83ec0cfda8e8a81df999796b34c38ac449fb59813e12ed239ad29c0f099c182" error="context deadline exceeded: unknown"

Ideally, the job should retry, but we mark an upgrade as a failure and delete upgrade repo VM after the first failure. There are some ways to work around this on slow disks:

Increase the job deadline before upgrading:

$ cat > /tmp/fix.yaml <<EOF
spec:
  values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
EOF

$ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge
$ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller

Delete the upgrade and start over again (This might need several times). Preload time will reduce due to some images are already in.
Manually preload images from ISO on each node first.

bk201 on Oct 11, 2022

Rerun the upgrade with the workaround stated in https://github.com/harvester/harvester/issues/2894#issuecomment-1274069690

By preparing all Harvester cluster nodes on SSD drive And increase the job deadline before clicking the upgrade button on dashboard:

$ cat > /tmp/fix.yaml <<EOF
spec:
  values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
EOF

$ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge
$ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller

We can complete the image preload stage
Proceed to upgrade node
And Successfully upgrade to v1.1.0-rc2

TachunLin on Oct 12, 2022

@khushboo-rancher, I tried running the upgrade to v1.1.0-rc3 on 3 nodes Harvester, all nodes are installed on SSD disk. Without setting the increase job workaround https://github.com/harvester/harvester/issues/2894#issuecomment-1274069690 In this trial we can still upgrade successfully.

TachunLin on Oct 19, 2022

Thank @connorkuehl. Upstream PR merged: https://github.com/rancher/system-upgrade-controller/pull/244 Also need to check if the SUC change will make it to Rancher 2.7.5. If not, we probably need to wait for the next milestone.

bk201 on Jun 28, 2023

To add a suggestion for HDD users in the upgrade doc first.

bk201 on Feb 21, 2023

Thanks Khushboo, My issue has been fixed by adding more value to requested memory for Prometheus Node Exporter

Error: on “kubectl get bundles -A”

fleet-local mcc-rancher-monitoring 0/1 ErrApplied(1) [Cluster fleet-local/local: cannot patch “rancher-monitoring-prometheus-node-exporter” with kind DaemonSet: DaemonSet.apps “rancher-monitoring-prometheus-node-exporter” is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: “10Gi”: must be less than or equal to memory limit] fleet-local mcc-rancher-monitoring-crd 1/1

we have edited the value for 10Gi for Prometheus Node Exporter Requested memory to 20 MI and worked good . The upgrade continued after this change

dseralathan on Dec 5, 2022

@TachunLin @bk201 thanks for sharing the workaround. I ran into a similar situation in my homelab currently where this same thing occurred. But the workaround seems to have fixed it. Should we mention this workaround in release notes for v1.1.0? I was running an instance of Harvester that was virtualized, not running on bare metal, using /vda based mount point for virtio drive - the underlying hardware however was a SATA3 Based SSD, on my 1U SuperMicro: https://www.supermicro.com/products/archive/motherboard/x9drd-it_ (using 1 of the 2 SATA 3 Ports)

irishgordo on Oct 13, 2022

@khushboo-rancher

The failure is due to the storage & network read/write speed, the upgrade tries to spread images to all nodes, that’s a big impact to storage and network, and the storage may be the main bottleneck. In our local test on KVM based cluster, that seems OK.

The workaround will be test it in newer/faster servers.

cc @bk201 @starbops

w13915984028 on Oct 10, 2022