harvester: [BUG] HarvesterMachines resources are not cleaned up
Describe the bug
HarvesterMachine resources are not cleaned up after removing Harvester RKE2 clusters that contained machines which were not fully provisioned.
This causes Rancher logs to be spammed with errors and increases the load on local cluster control plane over time due to constant retries.
To Reproduce
- Provision a Harvester RKE2 cluster in such way that provisioning of the VMs will not succeed, e.g. by attaching VMs to a VLAN w/o DHCP.
- Delete the “failed” cluster
- Check HarvesterMachine resources in the
localCluster - Check Rancher logs
022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/thorenexttry-pool1-fb8e624e-zmmkt': handler machine-provision-remove: cannot delete machine thorenexttry-pool1-fb8e624e-zmmkt because c │
│ reate job has not finished, requeuing │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-btjtx': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-5cqzx" n │
│ ot found, requeuing │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-hl22q': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-zv2sc" n │
│ ot found, requeuing │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/ubuntucluster-pool2-dfa5c3cc-z6snq': handler machine-provision-remove: machines.cluster.x-k8s.io "ubuntucluster-pool2-dfc84888-rxbhx" n │
│ ot found, requeuing │
│ 2022/01/05 12:17:01 [ERROR] error syncing 'fleet-default/cs2canal-pool1-a1661b5e-x9dwd': handler machine-provision-remove: cannot delete machine cs2canal-pool1-a1661b5e-x9dwd because create jo │
│ b has not finished, requeuing
HarvesterMachine CR:
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (15 by maintainers)
@lanfon72 There is a PR up for this issue but it has not been merged.
@TachunLin @lanfon72 can we please re-validate this with Harvester v1.0.3 + Rancher v2.6.7 when it is released since this is more considered as an upstream issue? thanks.
Its a Rancher issue as I have seen this issue with other node drivers, e.g. vSphere. It probably must be addressed upstream. Although based on my experience so far the issue is more notorious with Harvester provisioning.
@yasker The rancher job, and the pod on the harvester cluster are both left without being cleaned up. Once rancher sees the unfinished complete job it looks like it ignores the related resources.
I think it depends on what approach makes the most sense. If we are going to change how we handle unfinished create jobs when deleting clusters thats a rancher issue, but if we want to change how we handle the failed create job that would be a node driver issue.
Finally was able to reproduce with help from @dweomer. I’ll start drilling to find a cause for this