rancher: Init Node cannot be removed

Rancher Server Setup

  • Rancher version: 2.7.6
  • Installation option (Docker install/Helm Chart): Helm (1.26.7+rke2r1)

Information about the Cluster

  • Kubernetes version: 1.26.7 RKE2
  • Cluster Type (Local/Downstream): Downstream, vSphere

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • Admin

Describe the bug We were trying to cycle the master pool with an updated image template. However, the provisioning got stuck at the moment the old init node should be removed and a new one elected. The error message produced by Rancher is

[rkebootstrap] fleet-default/XXXXXXXXXXX-bootstrap-template-tk7vg: cluster fleet-default/XXXXXXXX machine fleet-default/XXXXXXXXXX-worker-844f8b754f-7wknz was still joined to deleting etcd machine fleet-default/XXXXXXXXXXX-master-6b494f986c-kg7df

We were able to work around this by removing the label rke.cattle.io/init-node: "true" from the machine plan secret of the old init node and adding it to another master node’s machine plan secret.

This happened on all clusters where we were trying to cycle the master pool so far.

To Reproduce

  • create a cluster (3x master, 3x worker)
  • edit master pool template causing a node rotation (image template, specs, etc)

Result Provisioning gets stuck when the init node is about to be removed.

Expected Result Node pool should be rotated without issues

Screenshots It’s stuck in this state forever without intervention image

Additional context

  • Issue was probably introduced with Rancher 2.7.6
  • Seems to NOT happen when scaling down a pool

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 23 (7 by maintainers)

Most upvoted comments

@thomashoell thank you so much for that.

I think I’ve identified the issue – it stems around the etcd snapshot creation logic that exists within Rancher. I believe a workaround for this would be to set your spec.EtcdSnapshotCreate back to nil so that the planner doesn’t try to “find” an init node for etcd snapshot creation.

I believe this should be reproducible on any cluster that has its init node deleted after a manual etcd snapshot creation is triggered.

The actual fix for this will be done in the codebase, but I need to think about the way to fix this without causing potential regressions.