harvester: [BUG] Multiple nodes are upgrading simultaneously in an RKE2 upgrade

Describe the bug

Upgrade to:

RKE2: v1.24.6+rke2r1
Rancher: v2.6.9-rc3

In a multi-node setup, the upgrade might be stuck in the middle. Behind the scenes, it is because there are multiple nodes (>= 2) upgrading simultaneously while upgrading RKE2, which implies they are in an unschedulable mode. This could result in the following while a node is in the pre-drain state:

The Longhorn volumes cannot go back from “degraded” to “healthy” state due to the replicas scheduling failure
The pre-drain job cannot live-migrate VMs out of the node they run on

Most of the time, the pre-drain Job will be stuck in the first situation because we only start to live-migrate VMs after all the volumes are healthy.

Below is the crime scene:

node-0:~ # k -n harvester-system get upgrades
NAME                 AGE
hvst-upgrade-t9mbq   4h29m
node-0:~ # k -n harvester-system get upgrades hvst-upgrade-t9mbq -o jsonpath='{.status.conditions}' | jq .
[
  {
    "status": "Unknown",
    "type": "Completed"
  },
  {
    "lastUpdateTime": "2022-10-11T09:46:52Z",
    "status": "True",
    "type": "ImageReady"
  },
  {
    "lastUpdateTime": "2022-10-11T09:49:11Z",
    "status": "True",
    "type": "RepoReady"
  },
  {
    "lastUpdateTime": "2022-10-11T10:36:19Z",
    "status": "True",
    "type": "NodesPrepared"
  },
  {
    "lastUpdateTime": "2022-10-11T10:59:49Z",
    "status": "True",
    "type": "SystemServicesUpgraded"
  },
  {
    "status": "Unknown",
    "type": "NodesUpgraded"
  }
]
node-0:~ # k -n harvester-system get upgrades hvst-upgrade-t9mbq -o jsonpath='{.status.nodeStatuses}' | jq .
{
  "node-0": {
    "state": "Succeeded"
  },
  "node-1": {
    "state": "Pre-drained"
  },
  "node-2": {
    "state": "Pre-draining"
  },
  "node-3": {
    "state": "Images preloaded"
  }
}

The first node node-0 is upgraded successfully. But node-1 and node-2 are being upgraded at the same time (though not exactly “start” at the “same time”, it violates what we expected, i.e. upgrade one node at a time). Both node-1 and node-2 are marked with SchedulingDisabled.

node-0:~ # k get no
NAME     STATUS                     ROLES                       AGE     VERSION
node-0   Ready                      control-plane,etcd,master   5d11h   v1.24.6+rke2r1
node-1   Ready,SchedulingDisabled   control-plane,etcd,master   5d10h   v1.24.6+rke2r1
node-2   Ready,SchedulingDisabled   control-plane,etcd,master   5d10h   v1.22.12+rke2r1
node-3   Ready                      <none>                      5d10h   v1.22.12+rke2r1

Go check the upgrade-related Pods. Here we can see the post-drain Job on the first upgraded node node-0 has been completed, which means node-0 is fully upgraded without issues. There is a pre-drain Job left on node-2 running. This implies that node-2 is in the pre-draining state, and VMs should be evacuated to other nodes. How about node-1? From the output of k get no shown above, we know that RKE2 on node-1 has been upgraded to v1.24.6+rke2r1. The uptime of node-1 is the same as node-2 and node-3 (given that all 4 nodes are booted up at the same time). We can infer that node-1 has never entered the post-drain state. But somehow node-2 is triggered to upgrade before the upgrade is completed on node-1.

node-0:~ # k -n harvester-system get po -l harvesterhci.io/upgradeComponent=node
NAME                                         READY   STATUS      RESTARTS   AGE
hvst-upgrade-t9mbq-post-drain-node-0-fbrwk   0/1     Completed   0          3h10m
hvst-upgrade-t9mbq-pre-drain-node-2-rbj2w    1/1     Running     0          160m

From the log of the pre-drain Job on node-2, we can see that it keeps waiting for the Longhorn volumes to become healthy.

node-0:~ # k -n harvester-system logs hvst-upgrade-t9mbq-pre-drain-node-2-rbj2w --since=1m
+ '[' true ']'
+ '[' 4 -gt 2 ']'
++ kubectl get volumes.longhorn.io/pvc-306f865e-5bfa-4d12-8779-fe3371425305 -n longhorn-system -o 'jsonpath={.status.robustness}'
+ robustness=degraded
+ '[' degraded = healthy ']'
+ '[' -f /tmp/skip-pvc-306f865e-5bfa-4d12-8779-fe3371425305 ']'
+ echo 'Waiting for volume pvc-306f865e-5bfa-4d12-8779-fe3371425305 to be healthy...'
+ sleep 10
Waiting for volume pvc-306f865e-5bfa-4d12-8779-fe3371425305 to be healthy...
+ '[' true ']'
+ '[' 4 -gt 2 ']'
++ kubectl get volumes.longhorn.io/pvc-306f865e-5bfa-4d12-8779-fe3371425305 -n longhorn-system -o 'jsonpath={.status.robustness}'
Waiting for volume pvc-306f865e-5bfa-4d12-8779-fe3371425305 to be healthy...
+ robustness=degraded
+ '[' degraded = healthy ']'
+ '[' -f /tmp/skip-pvc-306f865e-5bfa-4d12-8779-fe3371425305 ']'
+ echo 'Waiting for volume pvc-306f865e-5bfa-4d12-8779-fe3371425305 to be healthy...'
+ sleep 10
+ '[' true ']'
+ '[' 4 -gt 2 ']'
++ kubectl get volumes.longhorn.io/pvc-306f865e-5bfa-4d12-8779-fe3371425305 -n longhorn-system -o 'jsonpath={.status.robustness}'
Waiting for volume pvc-306f865e-5bfa-4d12-8779-fe3371425305 to be healthy...
+ robustness=degraded
+ '[' degraded = healthy ']'
+ '[' -f /tmp/skip-pvc-306f865e-5bfa-4d12-8779-fe3371425305 ']'
+ echo 'Waiting for volume pvc-306f865e-5bfa-4d12-8779-fe3371425305 to be healthy...'

That’s impossible for the current situation because there are only 2 schedulable nodes (node-1 and node-3) for all the 3 replicas to run on. We can prove this by examining the condition of the volume, it shows there’s a ReplicaSchedulingFailure:

node-0:~ # k -n longhorn-system get volumes pvc-306f865e-5bfa-4d12-8779-fe3371425305 -o jsonpath='{.status.conditions}' | jq .
[
  {
    "lastProbeTime": "",
    "lastTransitionTime": "2022-10-11T09:46:57Z",
    "message": "",
    "reason": "",
    "status": "False",
    "type": "restore"
  },
  {
    "lastProbeTime": "",
    "lastTransitionTime": "2022-10-11T11:30:29Z",
    "message": "",
    "reason": "ReplicaSchedulingFailure",
    "status": "False",
    "type": "scheduled"
  },
  {
    "lastProbeTime": "",
    "lastTransitionTime": "2022-10-11T09:46:57Z",
    "message": "",
    "reason": "",
    "status": "False",
    "type": "toomanysnapshots"
  }
]

And the running replicas of the volume only exist on node-0 and node-3

node-0:~ # k -n longhorn-system get lhr -l longhornvolume=pvc-306f865e-5bfa-4d12-8779-fe3371425305
NAME                                                  STATE     NODE     DISK                                   INSTANCEMANAGER               IMAGE                               AGE
pvc-306f865e-5bfa-4d12-8779-fe3371425305-r-4477ad68   running   node-0   24e133ac-90bf-47ea-bf3f-0a519adda3ec   instance-manager-r-7cffeda1   longhornio/longhorn-engine:v1.3.1   5h58m
pvc-306f865e-5bfa-4d12-8779-fe3371425305-r-8f915cdf   stopped                                                                                                                     4h14m
pvc-306f865e-5bfa-4d12-8779-fe3371425305-r-f902b787   running   node-3   1e8daca0-9a5b-47fc-b17a-e87da5c49edf   instance-manager-r-3bf011a4   longhornio/longhorn-engine:v1.3.1   5h58m

The whole upgrade is stuck in the pre-draining state on node-2 almost forever because the volume it waits for can never become a healthy one.

It’s time to take a look at the other side. The upgrade controller instructs Rancher to upgrade RKE2 on each node by updating the manifest of clusters.provisioning.cattle.io/local CR. Here we specified the upgrade concurrency of both the control plane and worker to 1 to make sure the upgrade of Harvester goes smoothly.

node-0:~ # k -n fleet-local get clusters local -o jsonpath='{.spec.rkeConfig}' | jq .
{
  "chartValues": null,
  "machineGlobalConfig": null,
  "provisionGeneration": 1,
  "upgradeStrategy": {
    "controlPlaneConcurrency": "1",
    "controlPlaneDrainOptions": {
      "deleteEmptyDirData": true,
      "enabled": true,
      "force": true,
      "ignoreDaemonSets": true,
      "postDrainHooks": [
        {
          "annotation": "harvesterhci.io/post-hook"
        }
      ],
      "preDrainHooks": [
        {
          "annotation": "harvesterhci.io/pre-hook"
        }
      ],
      "timeout": 0
    },
    "workerConcurrency": "1",
    "workerDrainOptions": {
      "deleteEmptyDirData": true,
      "enabled": true,
      "force": true,
      "ignoreDaemonSets": true,
      "postDrainHooks": [
        {
          "annotation": "harvesterhci.io/post-hook"
        }
      ],
      "preDrainHooks": [
        {
          "annotation": "harvesterhci.io/pre-hook"
        }
      ],
      "timeout": 0
    }
  }
}

The pre-drain and post-drain annotations are set on the corresponding secrets when the RKE2 of the node/machine is upgrading.

node-0:~ # k -n fleet-local get machines
NAME                  CLUSTER   NODENAME   PROVIDERID      PHASE     AGE     VERSION
custom-2d94d5d682dc   local     node-3     rke2://node-3   Running   5d13h
custom-7c1afab6e79d   local     node-0     rke2://node-0   Running   5d14h
custom-929d403d1670   local     node-1     rke2://node-1   Running   5d14h
custom-c05d0d11190c   local     node-2     rke2://node-2   Running   5d13h

Check the annotations of the secret custom-929d403d1670-machine-plan (node-1):

node-0:~ # k -n fleet-local get secrets custom-929d403d1670-machine-plan -o jsonpath='{.metadata.annotations}' | jq .
{
  "harvesterhci.io/pre-hook": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"
preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}",
  "objectset.rio.cattle.io/applied": "H4sIAAAAAAAA/4xTwW7bOhD8lYc9i36iLMuUgF6K9pRDgaDope5htVzGqihSIOmkhaF/L6TEaO3Ebo8iZ4azo9kjDJxQY0JojoDO+YSp8y7On779zpQip1Xo/IowJcurzv/faWgg9CwGpH3nGLKrUP/kOIiHx/6Z8cfNo8z+u+ucfnd/9/G99ymmgONflRwODA3QISY/iLqodZmvtay2+T9R4
4g0841lTsJ6QgvZhTGLLdsIDRx3sMfwyDFx2FM33w3o8IH1DpodpHDgHUwwZUCBl9Q+dwPHhMMIjTtYm8FJ6whkD7PO6ofoVZylXg5OE528XHkRmuW9G0PuMe6hgZJVyVpLs1FbluWm1kySVIFG57piaTA3bZHjq7Gv+LkAeZeCt2K06FgEb/m3sXMkJ9I3AS/dEUuZyKDKDVY5ttsNb3K5xpwKhRuqeFOqqjV1JZlbY1BxhbxeV0pSqZTmbV7L
q+I363JOefKh53BmecrgusCp/EsW8Ix8u15L/e7ZcGBHHKH5egQcuy8cYufdG4sBGbTWU/9pJn5gy2nBzZ4yePkFlsPppO/cnOHFHt0c/bCkLg1TgYUSZNZSlKRroYoqF8xUV4hI27KG6duUQfo58iujZwFMvwIAAP//B0mT1koEAAA",
  "objectset.rio.cattle.io/id": "rke-machine",
  "objectset.rio.cattle.io/owner-gvk": "rke.cattle.io/v1, Kind=RKEBootstrap",
  "objectset.rio.cattle.io/owner-name": "custom-929d403d1670",
  "objectset.rio.cattle.io/owner-namespace": "fleet-local",
  "rke.cattle.io/drain-done": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"
preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}",
  "rke.cattle.io/drain-options": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}]
,\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}",
  "rke.cattle.io/join-url": "https://192.168.122.121:9345",
  "rke.cattle.io/labels": "{\"harvesterhci.io/managed\":\"true\"}",
  "rke.cattle.io/post-drain": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"
preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}",
  "rke.cattle.io/pre-drain": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"p
reDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}",
  "rke.cattle.io/uncordon": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"pr
eDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"
}

Check the annotations of the secret custom-c05d0d11190c-machine-plan (node-2):

node-0:~ # k -n fleet-local get secrets custom-c05d0d11190c-machine-plan -o jsonpath='{.metadata.annotations}' | jq .
{
  "objectset.rio.cattle.io/applied": "H4sIAAAAAAAA/4xTzW7bPBB8lQ97Fv3xRz+WgF6K9pRDgaDope5hSa1iVRQpkHTSwtC7F1JitHZit0eRM8PZ0ewRRkrYYkJojoDO+YSp9y4un15/J5MipU3o/cZgSpY2vf+/b6GBMBAb0ex7R5BdhfonR4E9PA7PjD9uHkX2313v2nf3dx/fe59iCjj9VcnhSNCAOcTkR2Z40fJWCFFz80/UO
KFZ+J0lSsx6gxayC2MWNdkIDRx3sMfwSDFR2Jt+uRvR4QO1O2h2kMKBdjDDnIEJtKb2uR8pJhwnaNzB2gxOWkcw9rDobH6wYRsXqZeD00QnL1dehGZ978aQe4x7aEDUXFVVVZpC867OqdNFZbiSnSwqIkmC2iLXong19hU/FyDvUvCWTRYdseAt/TZ2jqRk2puAl+6wtUwSS8GraivzUnAjSCkjtFZc5LKotULTKSOR1FbVeVFKIlV32miqJKdc
mKviN+tyTnnyYaBwZnnO4LrAqfxrFvCMfLtea/3uqaNAzlCE5usRcOq/UIi9d28sBmSgrTfDp4X4gSylFbd4yuDlF1gKp5Ohd0uGF3t0c/TDmnpVtCi0RFbKHFle5i3TUhKrW56rnKtyW3cwf5szSD8nemX0LID5VwAAAP//98YwqkoEAAA",
  "objectset.rio.cattle.io/id": "rke-machine",
  "objectset.rio.cattle.io/owner-gvk": "rke.cattle.io/v1, Kind=RKEBootstrap",
  "objectset.rio.cattle.io/owner-name": "custom-c05d0d11190c",
  "objectset.rio.cattle.io/owner-namespace": "fleet-local",
  "rke.cattle.io/drain-options": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}]
,\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}",
  "rke.cattle.io/join-url": "https://192.168.122.122:9345",
  "rke.cattle.io/labels": "{\"harvesterhci.io/managed\":\"true\"}",
  "rke.cattle.io/pre-drain": "{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"p
reDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"
}

node-0:~ # k -n cattle-system logs -l app=rancher
W1012 02:39:20.056363      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 02:45:04.057107      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 02:47:50.962077      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 02:52:42.058995      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 02:53:17.964514      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 03:00:00.096962      33 transport.go:288] Unable to cancel request for *client.addQuery
W1012 03:00:25.077424      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 03:00:36.965184      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 03:05:52.966654      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 03:08:43.085276      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2022/10/12 03:00:05 [INFO] [planner] rkecluster fleet-local/local: waiting: uncordoning etcd node(s) custom-929d403d1670: waiting for uncordon to finish
2022/10/12 03:03:13 [INFO] Downloading repo index from http://harvester-cluster-repo.cattle-system/charts/index.yaml
2022/10/12 03:03:14 [INFO] Downloading repo index from https://releases.rancher.com/server-charts/stable/index.yaml
2022/10/12 03:08:13 [INFO] Downloading repo index from http://harvester-cluster-repo.cattle-system/charts/index.yaml
2022/10/12 03:08:14 [INFO] Downloading repo index from https://releases.rancher.com/server-charts/stable/index.yaml
2022/10/12 03:08:23 [INFO] [planner] rkecluster fleet-local/local: waiting: uncordoning etcd node(s) custom-929d403d1670: waiting for uncordon to finish
2022/10/12 03:08:23 [INFO] [planner] rkecluster fleet-local/local: waiting: uncordoning etcd node(s) custom-929d403d1670: waiting for uncordon to finish
2022/10/12 03:09:05 [INFO] [planner] rkecluster fleet-local/local: waiting: uncordoning etcd node(s) custom-929d403d1670: waiting for uncordon to finish
2022/10/12 03:09:08 [INFO] [planner] rkecluster fleet-local/local: waiting: uncordoning etcd node(s) custom-929d403d1670: waiting for uncordon to finish
2022/10/12 03:10:06 [INFO] [planner] rkecluster fleet-local/local: waiting: uncordoning etcd node(s) custom-929d403d1670: waiting for uncordon to finish
W1012 02:34:06.943613      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 02:39:19.762602      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 02:40:24.941360      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 02:47:05.943248      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 02:47:15.766388      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 02:52:50.945579      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 02:54:48.767367      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 02:59:23.961176      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1012 03:04:26.770103      33 warnings.go:80] network.harvesterhci.io/v1beta1 NodeNetwork is deprecated
W1012 03:05:31.964012      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+

To Reproduce

Steps to reproduce the behavior:

Prepare a multi-node (>= 3) Harvester cluster in v1.0.3
Trigger an upgrade to v1.1.0-rc2
Wait for the upgrade stuck in the pre-draining state (not always reproducible), and check the nodes’ state to confirm the issue has happened

Expected behavior

The RKE2 upgrade should be kicked off one node at a time and the upgrade of Harvester should end up successfully.

Support bundle

supportbundle_ebe4b217-6355-4356-823a-f9e5f09e28b4_2022-10-11T16-46-36Z.zip

Environment

Harvester ISO version: v1.1.0-rc2
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context

Related issue: rancher/rancher#39167 Related issues in previous versions: rancher/rancher#35999, rancher/rancher#37502

The issue could be work around temporarily with uncordoning the node manually, say kubectl uncordon node-1 from the above case. Then the upgrade will proceed. But there’s high chance encountering other RKE2 issues like #2893, restarting the rke2-server on the corresponding node might be necessary.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 22 (13 by maintainers)

Commits related to this issue

fix(upgrade): make sure only one node pre/post-draining This is a workaround for #2907, to prevent multi-node draining simultaneously. Signed-off-by: Zespre Chang <zespre.chang@suse.com> — committed to starbops/harvester by starbops 2 years ago
fix(upgrade): make sure only one node pre/post-draining This is a workaround for #2907, to prevent multi-node draining simultaneously. Signed-off-by: Zespre Chang <zespre.chang@suse.com> — committed to starbops/harvester by starbops 2 years ago
fix(upgrade): make sure only one node pre/post-draining This is a workaround for #2907, to prevent multi-node draining simultaneously. Signed-off-by: Zespre Chang <zespre.chang@suse.com> — committed to starbops/harvester by starbops 2 years ago
fix(upgrade): make sure only one node pre/post-draining This is a workaround for #2907, to prevent multi-node draining simultaneously. Signed-off-by: Zespre Chang <zespre.chang@suse.com> — committed to harvester/harvester by starbops 2 years ago
fix(upgrade): make sure only one node pre/post-draining This is a workaround for #2907, to prevent multi-node draining simultaneously. Signed-off-by: Zespre Chang <zespre.chang@suse.com> — committed to bk201/harvester by starbops 2 years ago
fix(upgrade): make sure only one node pre/post-draining This is a workaround for #2907, to prevent multi-node draining simultaneously. Signed-off-by: Zespre Chang <zespre.chang@suse.com> — committed to harvester/harvester by starbops 2 years ago

Most upvoted comments

@irishgordo It’s normal to see some error logs during an upgrade since there are versions bump or node offlines. I left some comments here:

1. We need to check. If the cluster is ready at the end and it should be OK. Run kubectl get clusters.provisioning.cattle.io local -n fleet-local -o yaml to check if it’s ready at the end.
2. autoscaling/v2 is deprecated in v1.23. (We upgrade to v1.24.x).
3, 4: this should be proper behavior during an upgrade since we turn on and off nodes.
5, 6.: This is due to the Kubernetes version being mismatched during an upgrade: some nodes are v1.24.x and some are still v1.22. Eventually, those helm-install pods should succeed.
7. Known issue. The job will run again and succeed.
8. @FrankYang0529 Please take a look. It might be related to the hostname or something.

bk201 on Oct 18, 2022

@irishgordo Yes, we can track the cattle-system/sync-containerd-registry issue in another ticket. Thanks!

bk201 on Oct 18, 2022

@starbops you’re welcome, I’m glad I could help! 😄 – I do think yeah perhaps the there is something funny with the I/O limit on the disk, but the strange thing is, that it is SATA3, a Kingston or Samsung SSD (I don’t recall the brand 😅 )- that’s is installed on my 1U server. ( https://www.supermicro.com/products/archive/motherboard/x9drd-it_ )

@bk201 I could close this out and open up another ticket for things found with number 8 -with the RFC-1123 for cattle-system/sync-containerd-registry, if that works?

irishgordo on Oct 18, 2022

Hi @irishgordo, thank you for spending so much time on this! Just curious what’s the type of your disks? Is it HDD? Could be a slow speed causes that many interesting errors that eventually self-healed.

I think we can only wait for rancher/rancher#39167 to reproduce and fix it from the root. This issue is just for a record of what we can do to get around this at this critical moment. Since we have a certain degree of protection and workarounds on our side to overcome it, I think we can close this out as you suggested.

cc @bk201

starbops on Oct 18, 2022

Updating some new observations here.

node-0 is fully upgraded
node-1 is pre-drained and entering the draining state
node-2 is somewhat mistakenly put in a SchedulingDisabled mode by Rancher after node-1 is pre-drained
node-3 is untouched yet

In this kind of situation, the fix in #2923 can actually guard the upgrade procedure. It will hold up the pre-drain Job that’s going to be placed on node-2 and let the draining and post-drain Job finish on node-1 without any interference. After node-1 is fully upgraded, the pre-drain Job will be placed on node-2 and proceed with the upgrade.

But if node-2 is put in the SchedulingDisabled mode by Rancher before node-1 is pre-drained, there is a high chance that the pre-drain Job on node-1 will be stuck in the waiting for all LH volumes to become healthy. The workaround for this is to uncordon node-2 or set the degraded volumes’ numberOfReplicas to 2 or 1 to make them healthy (prefer the former one).

starbops on Oct 18, 2022

@starbops thanks for looking into this! 😄 I spent about 3.5hrs working through another run of this- setting up the env and then running the updates on 3 8C 16GiB 200GiB Disk QEMU/KVM VMs on a supermicro 1u. Testing upgrading from v1.0.3 to Harvester version v1.1.0-rc3.

I wasn’t able to see it be hung like it was in the past.

In order to prepare the 3 node cluster I needed to run through:

cat > /tmp/fix.yaml <<EOF
spec:
  values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
EOF

then:

kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge

then:

kubectl -n cattle-system rollout restart deploy/system-upgrade-controller

As the first time I did encounter the “Job was active longer than specified deadline” - https://github.com/harvester/harvester/issues/2894

Interesting Things I Saw:

“rancher-uiid” CAPI Reconciler error

rancher-59b5c9cf89-vfsls 2022/10/18 05:31:15 [ERROR] [CAPI] Reconciler error: timed out waiting for the condition  rancher-59b5c9cf89-vfsls 2022/10/18 05:31:15 [ERROR] [planner] rkecluster fleet-local/local: error encountered during plan processing was Operation cannot be fulfilled on machines.cluster.x-k8s.rancher-59b5c9cf89-vfsls 2022/10/18 05:31:15 [ERROR] [planner] rkecluster fleet-local/local: error encountered during plan processing was Operation cannot be fulfilled on machines.cluster.x-k8s.rancher-59b5c9cf89-vfsls 2022/10/18 05:31:15 [ERROR] [planner] rkecluster fleet-local/local: error encountered during plan processing was Operation cannot be fulfilled on machines.cluster.x-k8s.rancher-59b5c9cf89-vfsls 2022/10/18 05:31:16 [ERROR] [planner] rkecluster fleet-local/local: error encountered during plan processing was Operation cannot be fulfilled on machines.cluster.x-k8s.rancher-59b5c9cf89-vfsls 2022/10/18 05:31:16 [ERROR] [planner] rkecluster fleet-local/local: error encountered during plan processing was Operation cannot be fulfilled on machines.cluster.x-k8s.rancher-59b5c9cf89-vfsls 2022/10/18 05:31:16 [ERROR] [CAPI] Reconciler error: timed out waiting for the condition

Harvester webhook requing often for:

harvester-webhook-5d59686d54-mzplz E1018 05:23:50.779850       7 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: autoscaling/v2: the server could not find the requested resource, flo

Fleet agent, mentioning that that harvester was progressing but with error message:

fleet-agent-58fd796595-w8tdk time="2022-10-18T05:28:16Z" level=error msg="bundle mcc-harvester: deployment.apps harvester-system/harvester [progressing] Available: 2/3"

Fleet agent error deployment doesn’t have minimum availability

fleet-agent-58fd796595-w8tdk time="2022-10-18T05:27:52Z" level=error msg="bundle mcc-harvester: deployment.apps longhorn-system/longhorn-conversion-webhook [progressing] Deployment does not have minimum availability., Avail

Occasional helm-install-rke2-coredns error:

FROM
kube-system                    helm-install-rke2-coredns-5jcxf 
Events:                                                                                                                                                                                                                                                                         
  Type     Reason     Age                   From               Message                                                                                                                                                                                                          
  ----     ------     ----                  ----               -------                                                                                                                                                                                                          
  Normal   Scheduled  4m5s                  default-scheduler  Successfully assigned kube-system/helm-install-rke2-coredns-5jcxf to vm-v103-node-c                                                                                                                              
  Normal   Pulling    4m5s                  kubelet            Pulling image "rancher/klipper-helm:v0.7.3-build20220613"                                                                                                                                                        
  Normal   Pulled     3m47s                 kubelet            Successfully pulled image "rancher/klipper-helm:v0.7.3-build20220613" in 17.63559317s                                                                                                                            
  Normal   Created    116s (x5 over 3m47s)  kubelet            Created container helm                                                                                                                                                                                           
  Normal   Started    116s (x5 over 3m47s)  kubelet            Started container helm                                                                                                                                                                                           
  Normal   Pulled     116s (x4 over 3m34s)  kubelet            Container image "rancher/klipper-helm:v0.7.3-build20220613" already present on machine                                                                                                                           
  Warning  BackOff    98s (x9 over 3m29s)   kubelet            Back-off restarting failed container

a helm-install-rke2-multus error

FROM
kube-system/helm-install-rke2-multus-2tgk9
Events:                                                                                                                                                                                                                                                                         
  Type     Reason     Age                    From               Message                                                                                                                                                                                                         
  ----     ------     ----                   ----               -------                                                                                                                                                                                                         
  Normal   Scheduled  4m55s                  default-scheduler  Successfully assigned kube-system/helm-install-rke2-multus-2tgk9 to vm-v103-node-b                                                                                                                              
  Normal   Pulled     2m54s (x5 over 4m52s)  kubelet            Container image "rancher/klipper-helm:v0.7.3-build20220613" already present on machine                                                                                                                          
  Normal   Created    2m54s (x5 over 4m52s)  kubelet            Created container helm                                                                                                                                                                                          
  Normal   Started    2m54s (x5 over 4m52s)  kubelet            Started container helm                                                                                                                                                                                          
  Warning  BackOff    2m8s (x10 over 4m23s)  kubelet            Back-off restarting failed container    
#### some of the log####
+ helm_v3 mapkubeapis rke2-multus --namespace kube-system                                                                                                                                                                      
Already installed rke2-multus                                                                                                                                                                                                  
2022/10/18 05:18:22 Release 'rke2-multus' will be checked for deprecated or removed Kubernetes APIs and will be updated if necessary to supported API versions.                                                                
2022/10/18 05:18:22 Get release 'rke2-multus' latest version.                                                                                                                                                                  
2022/10/18 05:18:22 Check release 'rke2-multus' for deprecated or removed APIs...                                                                                                                                              
2022/10/18 05:18:22 Finished checking release 'rke2-multus' for deprecated or removed APIs.                                                                                                                                    
2022/10/18 05:18:22 Release 'rke2-multus' has no deprecated or removed APIs.                                                                                                                                                   
2022/10/18 05:18:22 Map of release 'rke2-multus' deprecated or removed APIs to supported versions, completed successfully.                                                                                                     
+ echo 'Upgrading helm_v3 chart'                                                                                                                                                                                               
+ echo 'Upgrading rke2-multus'                                                                                                                                                                                                 
+ shift 1                                                                                                                                                                                                                      
+ helm_v3 upgrade --set-string global.cattle.systemDefaultRegistry= --set-string global.clusterCIDR=10.52.0.0/16 --set-string global.clusterCIDRv4=10.52.0.0/16 --set-string global.clusterCIDRv6= --set-string global.clusterD
Upgrading rke2-multus                                                                                                                                                                                               
Error: UPGRADE FAILED: chart requires kubeVersion: >= v1.24.6 which is incompatible with Kubernetes v1.22.12+rke2r1                                                                                                            
+ exit                                                                                                                                                                                                                         
Stream closed EOF for kube-system/helm-install-rke2-multus-2tgk9 (helm)

something from hvst-upgrade-uuid-apply-manifests, about fluentbit root daemonsets.apps

2022-10-18T04:51:06.913504088Z Apply resource file                                                                                                                                                                             
2022-10-18T04:51:11.129036311Z namespace/cattle-logging-system created                                                                                                                                                         
2022-10-18T04:51:11.187194291Z managedchart.management.cattle.io/rancher-logging-crd created                                                                                                                                   
2022-10-18T04:51:11.263464611Z managedchart.management.cattle.io/rancher-logging created                                                                                                                                       
2022-10-18T04:52:01.290019726Z Wait for rollout of logging and audit                                                                                                                                                           
2022-10-18T04:52:02.033226752Z deployment "rancher-logging" successfully rolled out                                                                                                                                            
2022-10-18T04:52:02.357286991Z Error from server (NotFound): daemonsets.apps "rancher-logging-root-fluentbit" not found                                                                                                        
2022-10-18T04:55:28.352888094Z Stream closed EOF for harvester-system/hvst-upgrade-sdcw5-apply-manifests-bg6lm (apply)
### same sorta but displayed different ###
Apply resource file   
namespace/cattle-logging-system created  
managedchart.management.cattle.io/rancher-logging-crd created 
managedchart.management.cattle.io/rancher-logging created        
Wait for rollout of logging and audit     
deployment "rancher-logging" successfully rolled out                                                                                                               
Error from server (NotFound): daemonsets.apps "rancher-logging-root-fluentbit" not found       
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again

lots of logs surrounding RFC 1123 with harvester-containerd-registry:

time="2022-10-18T04:43:45Z" level=error msg="error syncing 'cattle-system/sync-containerd-registry': handler system-upgrade-controller: secrets \"harvester-containerd-registry\" not found, handler system-upgrade-controller: failed to create cattle-system/apply-sync-containerd-registry-on-vm-v103-node-c-with- batch/v1, Kind=Job for system-upgrade-controller cattle-system/sync-containerd-registry: Job.batch \"apply-sync-containerd-registry-on-vm-v103-node-c-with-\" is invalid: [metadata.name: Invalid value: \"apply-sync-containerd-registry-on-vm-v103-node-c-with-\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: \"apply-sync-containerd-registry-on-vm-v103-node-c-with-\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')], requeuing"

For all interesting things I saw, they seemed to resolve themselves and did not impact the upgrade. The upgrade itself did take about 2 hours, on 3 nodes.

The Process I Noticed

First to upgrade was the ‘main’ (original) node, entered pre-drain, then post-drain, then reboot, then succeeded
Second to upgrade was the node-c, entered pre-drain, then post-drain, then reboot, then succeeded
Third to upgrade was node-b, pre-drain…however, I didn’t catch it really at any other states I don’t believe, it just like then immediately went to reboot…I’m not sure

Test Artifacts

Screenshot from 2022-10-17 22-41-50

Additional Test Artifact

I have the full length video 1.4Gi available if desired. Here is it sped up a lot.

https://user-images.githubusercontent.com/5370752/196349175-8d7fece7-b70f-4f6c-bb97-6a2969b713f3.mp4

@starbops - since it succeeded and the things that I noticed that arose ended up seeming to resolve themselves, I feel like we could close this out, pending that we have workarounds for some of the old instance-manager pods and a few other things like rebooting the node if it does get stuck as well. Since they seem to be behaving correctly in terms of pre-draining, draining, post-draining, reboot, and succeded - and only one is doing that at any given time 😄 - What are your thoughts?

irishgordo on Oct 18, 2022