rancher: RKE2/K3S Upgrades from Rancher not working if master nodes are tainted on imported rke2/k3s clusters

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

  • Create a highly available RKE2 cluster, with an older version (e.g. v1.20.4+rke2r1) and configure the control-plane/master nodes with CriticalAddonsOnly=true:NoExecute taint as documented here https://docs.rke2.io/install/ha/#2a-optional-consider-server-node-taints. This is a best practices to prevent user workloads to run on the control plane nodes.
  • Import the RKE2 cluster into Rancher
  • Upgrade the RKE2 cluster from the Rancher UI (Edit Cluster, Change version, Save)

Result: The system-upgrade-controller Pod, as well as the Pods of the rke2 master upgrade plan can’t be scheduled because they are missing a toleration for the CriticalAddonsOnly=true:NoExecute taint.

Other details that may be helpful:

This also affects a K3S cluster, where such a taint has been added, as documented at https://rancher.com/docs/k3s/latest/en/installation/ha/#2-launch-server-nodes

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.7
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Imported RKE2
  • Kubernetes version (use kubectl version): v1.20.4+rke2r1

SURE-3506

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 21 (14 by maintainers)

Most upvoted comments

After adding the CriticalAddonsOnly toleration to the system-upgrade-controller allowed it to deploy.

Updating the plan for the master nodes to provision with the appropriate toleration allowed the upgrade to continue:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  creationTimestamp: "2021-04-22T12:26:28Z"
  generation: 2
  labels:
    rancher-managed: "true"
  managedFields:
  - apiVersion: upgrade.cattle.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:rancher-managed: {}
      f:spec:
        .: {}
        f:concurrency: {}
        f:cordon: {}
        f:drain:
          .: {}
          f:force: {}
        f:nodeSelector:
          .: {}
          f:matchExpressions: {}
        f:serviceAccountName: {}
        f:upgrade:
          .: {}
          f:image: {}
        f:version: {}
    manager: rancher
    operation: Update
    time: "2021-04-22T12:26:28Z"
  - apiVersion: upgrade.cattle.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:applying: {}
        f:conditions: {}
        f:latestHash: {}
        f:latestVersion: {}
    manager: system-upgrade-controller
    operation: Update
    time: "2021-04-22T12:26:28Z"
  - apiVersion: upgrade.cattle.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:tolerations: {}
    manager: agent
    operation: Update
    time: "2021-04-22T13:40:08Z"
  name: rke2-master-plan
  namespace: cattle-system
  resourceVersion: "210404"
  selfLink: /apis/upgrade.cattle.io/v1/namespaces/cattle-system/plans/rke2-master-plan
  uid: e433cf96-05e9-486a-90a2-493cf8267951
spec:
  concurrency: 1
  cordon: true
  drain:
    force: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/master
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  tolerations:
  - effect: NoExecute
    key: CriticalAddonsOnly
    operator: Equal
    value: "true"
  upgrade:
    image: rancher/rke2-upgrade
  version: v1.20.5+rke2r1
status:
  applying:
  - controlplane01
  conditions:
  - lastUpdateTime: "2021-04-22T13:40:14Z"
    reason: Version
    status: "True"
    type: LatestResolved
  latestHash: 8328420e692ecd89c06ecce046b0f6c089a2f7b6331567e81bfb6f37
  latestVersion: v1.20.5-rke2r1

Agent nodes successfully updated after controlplane completed.

Works for me now. The upgrade runs through successfully.