harvester: [BUG] reinstall 1st node

Describe the bug I tried to reinstall 1st node (installed with mode: create) and the node after reinstallation didn`t joined existing cluster.

To Reproduce Steps to reproduce the behavior:

install 3 node cluster
turn off 1st node (installed with mode: create), remove it from cluster via GUI - hosts - delete host, wipe all disks (with fdisk remove partititions and mkfs.ext4 /dev/sda) and reinstall this node (install mode: join) to join existing cluster with rest 2 nodes with new hostname.
reinstalled node is not able to join existing cluster (rancher bootstrap finished successfully, but node is not joined).

Expected behavior Reinstalled node should join existing cluster with rest 2 nodes.

Support bundle can provide if needed

Environment

Harvester ISO version: v1.0.3
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): tried on 2 different environments with different HW

Additional context When i tried to reinstall 2nd or 3rd node (installation mode: join) they join the cluster successfully after reinstallation. The only problem is with the 1st node.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 22 (11 by maintainers)

Most upvoted comments

Test Plan 2: Reinstall management node and agent node in a new Harvester cluster

Result

Verified after upgrading from v1.0.3 to v1.1.0-rc3,we can rejoin the management node and agent node back correctly.

Successfully re-join the management node after upgrade
Successfully re-join the agent node after upgrade

Test Information

Test Environment: 4 nodes harvester on local kvm machine
Harvester version: v1.1.0-rc3

Verify Steps

Create a 4-node v1.0.3 cluster.
Upgrade to the v1.1.0-rc3
Remove the agent node and the first management node
- Remove agent node (node 4)

Remove management node (node 1)

After the node is removed, provision a new node.
Check the node can join the cluster and can be promoted as a management node. (node 1)
After we have 3 management nodes, provision a new node and check if it can join the cluster. (node 4)

TachunLin on Oct 18, 2022

Test Plan 1: Reinstall management node and agent node in a upgraded cluster

Result

Verified after upgrading from v1.0.3 to v1.1.0 master release, we can rejoin the management node and agent node back correctly.

Successfully re-join the management node after upgrade
Successfully re-join the agent node after upgrade

Test Information

Test Environment: 4 nodes harvester on local kvm machine
Harvester version: master-0a9538a1-head (10/14)

Verify Steps

Create a 4-node v1.0.3 cluster.
Upgrade the master branch:

Check the spec content in provisioning.cattle.io/v1/clusters -> fleet-local
Check the iface content in helm.cattle.io/v1/helmchartconfigs -> rke2-canal

 spec:                                                                                                                                                                                      │
│   valuesContent: |-                                                                                                                                                                        │
│     flannel:                                                                                                                                                                               │
│       iface: ""

Remove the agent node and 1 management node.
- Remove agent node (node 4)

Remove management node (node 3)

After the node is removed, provision a new node.
Check the node can join the cluster and can be promoted as a management node.
After we have 3 management nodes, provision a new node and check if it can join the cluster.

TachunLin on Oct 17, 2022

@FrankYang0529 Great!

Please help check if #2470 is also caused by this bug, thanks.

w13915984028 on Sep 29, 2022

This issue can be fixed https://github.com/harvester/harvester-installer/pull/344 and https://github.com/rancher/rancherd/pull/25. Following are my test steps with vagrant-pxe-harvester.

Case 1: Remove the second node

Setup a 3-node cluster (harvester-node-0, harvester-node-1, harvester-node-2).
After all 3-node cluster is installed, please run kubectl delete node harvester-node-1 to remove the node CR.
After removing the node CR, please run vagrant destroy harvester-node-1 to remove the node VM.
Run vagrant up harvester-node-1 to start the node again. It should join the cluster and the status will be Ready.

Case 2: Remove the first node

Follow Case 1.
Run kubectl delete node harvester-node-0 to remove the node CR.
After removing the node CR, please run vagrant destroy harvester-node-0 to remove the node VM.
Run vagrant ssh pxe_server to update some content.

# copy config-join-1.yaml to config-join-0.yaml
cp /var/www/harvester/config-join-1.yaml /var/www/harvester/config-join-0.yaml

# update os.hostname to harvester-node-0 in config-join-0.yaml
vim /var/www/harvester/config-join-0.yaml

# change owner of config-join-0.yaml
chown www-data /var/www/harvester/config-join-0.yaml

# update 02:00:00:0d:62:e2. In the last line, change config-create.yaml to config-join-0.yaml
vim /var/www/harvester/02:00:00:0d:62:e2

# restart nginx
systemctl restart nginx.service

Run vagrant up harvester-node-0 to start the node again. It should join the cluster and the status will be Ready.

FrankYang0529 on Sep 29, 2022

@tgazik we are still working on it. and it seems it’s easy to reproduce. So if you need the cluster, you could reinstall it. Thanks a lot

tjjh89017 on Sep 6, 2022

The host is still not joined into the cluster after 2 days. I keep the cluster installed.

We are working with Rancher team to debug. If possible, could you keep that env? If rancher team needs more details. Thanks a lot

Sure i will keep it, no problem. Thanks for update

tgazik on Aug 17, 2022