harvester: [BUG] reinstall 1st node

Describe the bug I tried to reinstall 1st node (installed with mode: create) and the node after reinstallation didn`t joined existing cluster.

To Reproduce Steps to reproduce the behavior:

  1. install 3 node cluster
  2. turn off 1st node (installed with mode: create), remove it from cluster via GUI - hosts - delete host, wipe all disks (with fdisk remove partititions and mkfs.ext4 /dev/sda) and reinstall this node (install mode: join) to join existing cluster with rest 2 nodes with new hostname.
  3. reinstalled node is not able to join existing cluster (rancher bootstrap finished successfully, but node is not joined).

Expected behavior Reinstalled node should join existing cluster with rest 2 nodes.

Support bundle can provide if needed

Environment

  • Harvester ISO version: v1.0.3
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): tried on 2 different environments with different HW

Additional context When i tried to reinstall 2nd or 3rd node (installation mode: join) they join the cluster successfully after reinstallation. The only problem is with the 1st node.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 22 (11 by maintainers)

Most upvoted comments

Test Plan 2: Reinstall management node and agent node in a new Harvester cluster

Result

Verified after upgrading from v1.0.3 to v1.1.0-rc3,we can rejoin the management node and agent node back correctly.

  1. Successfully re-join the management node after upgrade image image

  2. Successfully re-join the agent node after upgrade image image

Test Information

  • Test Environment: 4 nodes harvester on local kvm machine
  • Harvester version: v1.1.0-rc3

Verify Steps

  1. Create a 4-node v1.0.3 cluster.
  2. Upgrade to the v1.1.0-rc3
  3. Remove the agent node and the first management node
    • Remove agent node (node 4) image image image
  • Remove management node (node 1) image image
  1. After the node is removed, provision a new node.
  2. Check the node can join the cluster and can be promoted as a management node. (node 1)
  3. After we have 3 management nodes, provision a new node and check if it can join the cluster. (node 4)

Test Plan 1: Reinstall management node and agent node in a upgraded cluster

Result

Verified after upgrading from v1.0.3 to v1.1.0 master release, we can rejoin the management node and agent node back correctly.

  1. Successfully re-join the management node after upgrade image

  2. Successfully re-join the agent node after upgrade image

Test Information

  • Test Environment: 4 nodes harvester on local kvm machine
  • Harvester version: master-0a9538a1-head (10/14)

Verify Steps

  1. Create a 4-node v1.0.3 cluster.
  2. Upgrade the master branch:
  • Check the spec content in provisioning.cattle.io/v1/clusters -> fleet-local image
  • Check the iface content in helm.cattle.io/v1/helmchartconfigs -> rke2-canal image
 spec:                                                                                                                                                                                      │
│   valuesContent: |-                                                                                                                                                                        │
│     flannel:                                                                                                                                                                               │
│       iface: "" 
  1. Remove the agent node and 1 management node.
    • Remove agent node (node 4) image image
  • Remove management node (node 3) image image image
  1. After the node is removed, provision a new node.
  2. Check the node can join the cluster and can be promoted as a management node.
  3. After we have 3 management nodes, provision a new node and check if it can join the cluster.

@FrankYang0529 Great!

Please help check if #2470 is also caused by this bug, thanks.

This issue can be fixed https://github.com/harvester/harvester-installer/pull/344 and https://github.com/rancher/rancherd/pull/25. Following are my test steps with vagrant-pxe-harvester.

Case 1: Remove the second node

  1. Setup a 3-node cluster (harvester-node-0, harvester-node-1, harvester-node-2).
  2. After all 3-node cluster is installed, please run kubectl delete node harvester-node-1 to remove the node CR.
  3. After removing the node CR, please run vagrant destroy harvester-node-1 to remove the node VM.
  4. Run vagrant up harvester-node-1 to start the node again. It should join the cluster and the status will be Ready.

Case 2: Remove the first node

  1. Follow Case 1.
  2. Run kubectl delete node harvester-node-0 to remove the node CR.
  3. After removing the node CR, please run vagrant destroy harvester-node-0 to remove the node VM.
  4. Run vagrant ssh pxe_server to update some content.
# copy config-join-1.yaml to config-join-0.yaml
cp /var/www/harvester/config-join-1.yaml /var/www/harvester/config-join-0.yaml

# update os.hostname to harvester-node-0 in config-join-0.yaml
vim /var/www/harvester/config-join-0.yaml

# change owner of config-join-0.yaml
chown www-data /var/www/harvester/config-join-0.yaml

# update 02:00:00:0d:62:e2. In the last line, change config-create.yaml to config-join-0.yaml
vim /var/www/harvester/02:00:00:0d:62:e2

# restart nginx
systemctl restart nginx.service
  1. Run vagrant up harvester-node-0 to start the node again. It should join the cluster and the status will be Ready.

@tgazik we are still working on it. and it seems it’s easy to reproduce. So if you need the cluster, you could reinstall it. Thanks a lot

The host is still not joined into the cluster after 2 days. I keep the cluster installed.

We are working with Rancher team to debug. If possible, could you keep that env? If rancher team needs more details. Thanks a lot

Sure i will keep it, no problem. Thanks for update