harvester: [BUG] HarvesterMachines resources are not cleaned up

Describe the bug

HarvesterMachine resources are not cleaned up after removing Harvester RKE2 clusters that contained machines which were not fully provisioned.

This causes Rancher logs to be spammed with errors and increases the load on local cluster control plane over time due to constant retries.

To Reproduce

  1. Provision a Harvester RKE2 cluster in such way that provisioning of the VMs will not succeed, e.g. by attaching VMs to a VLAN w/o DHCP.
  2. Delete the “failed” cluster
  3. Check HarvesterMachine resources in the local Cluster
  4. Check Rancher logs
022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/thorenexttry-pool1-fb8e624e-zmmkt': handler machine-provision-remove: cannot delete machine thorenexttry-pool1-fb8e624e-zmmkt because c │
│ reate job has not finished, requeuing                                                                                                                                                            │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-btjtx': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-5cqzx" n │
│ ot found, requeuing                                                                                                                                                                              │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-hl22q': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-zv2sc" n │
│ ot found, requeuing                                                                                                                                                                              │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-z6snq': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-rxbhx" n │
│ ot found, requeuing                                                                                                                                                                              │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/cs2canal-pool1-a1661b5e-x9dwd': handler machine-provision-remove: cannot delete machine cs2canal-pool1-a1661b5e-x9dwd because create jo │
│ b has not finished, requeuing   

HarvesterMachine CR:

cs2canal-pool1-a1661b5e-x9dwd.yaml.txt

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (15 by maintainers)

Most upvoted comments

@lanfon72 There is a PR up for this issue but it has not been merged.

@TachunLin @lanfon72 can we please re-validate this with Harvester v1.0.3 + Rancher v2.6.7 when it is released since this is more considered as an upstream issue? thanks.

Its a Rancher issue as I have seen this issue with other node drivers, e.g. vSphere. It probably must be addressed upstream. Although based on my experience so far the issue is more notorious with Harvester provisioning.

@yasker The rancher job, and the pod on the harvester cluster are both left without being cleaned up. Once rancher sees the unfinished complete job it looks like it ignores the related resources.

I think it depends on what approach makes the most sense. If we are going to change how we handle unfinished create jobs when deleting clusters thats a rancher issue, but if we want to change how we handle the failed create job that would be a node driver issue.

Finally was able to reproduce with help from @dweomer. I’ll start drilling to find a cause for this