harvester: [BUG] HarvesterMachines resources are not cleaned up

Describe the bug

HarvesterMachine resources are not cleaned up after removing Harvester RKE2 clusters that contained machines which were not fully provisioned.

This causes Rancher logs to be spammed with errors and increases the load on local cluster control plane over time due to constant retries.

To Reproduce

Provision a Harvester RKE2 cluster in such way that provisioning of the VMs will not succeed, e.g. by attaching VMs to a VLAN w/o DHCP.
Delete the “failed” cluster
Check HarvesterMachine resources in the local Cluster
Check Rancher logs

022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/thorenexttry-pool1-fb8e624e-zmmkt': handler machine-provision-remove: cannot delete machine thorenexttry-pool1-fb8e624e-zmmkt because c │
│ reate job has not finished, requeuing                                                                                                                                                            │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-btjtx': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-5cqzx" n │
│ ot found, requeuing                                                                                                                                                                              │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-hl22q': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-zv2sc" n │
│ ot found, requeuing                                                                                                                                                                              │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-z6snq': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-rxbhx" n │
│ ot found, requeuing                                                                                                                                                                              │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/cs2canal-pool1-a1661b5e-x9dwd': handler machine-provision-remove: cannot delete machine cs2canal-pool1-a1661b5e-x9dwd because create jo │
│ b has not finished, requeuing

HarvesterMachine CR:

cs2canal-pool1-a1661b5e-x9dwd.yaml.txt

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (15 by maintainers)

Most upvoted comments

@lanfon72 There is a PR up for this issue but it has not been merged.

joshmeranda on Aug 15, 2022

@TachunLin @lanfon72 can we please re-validate this with Harvester v1.0.3 + Rancher v2.6.7 when it is released since this is more considered as an upstream issue? thanks.

guangbochen on Aug 12, 2022

Its a Rancher issue as I have seen this issue with other node drivers, e.g. vSphere. It probably must be addressed upstream. Although based on my experience so far the issue is more notorious with Harvester provisioning.

janeczku on Mar 11, 2022

@yasker The rancher job, and the pod on the harvester cluster are both left without being cleaned up. Once rancher sees the unfinished complete job it looks like it ignores the related resources.

I think it depends on what approach makes the most sense. If we are going to change how we handle unfinished create jobs when deleting clusters thats a rancher issue, but if we want to change how we handle the failed create job that would be a node driver issue.

joshmeranda on Mar 10, 2022

Finally was able to reproduce with help from @dweomer. I’ll start drilling to find a cause for this

joshmeranda on Mar 8, 2022