cluster-api: clusterctl delete everything returns an error intermittenly
What steps did you take and what happened:
- Install components via
clusterctl init --infrastructure=aws:v0.5.0 - Try and delete all the providers, namespaces and crds. And repeat the process a few times.
Deleted all providers but returned an error which left resources behind
$ clusterctl delete --all --include-namespace --include-crd
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: controlplane.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource
Deleted some providers but returned an error which left resources behind
$ clusterctl delete --all --include-crd --include-namespace
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: bootstrap.cluster.x-k8s.io/v1alpha2: the server could not find the requested resource, bootstrap.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource
Everything deleted successfully!
$ clusterctl delete --all --include-crd --include-namespace
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"
What did you expect to happen: Everything to delete successfully
Anything else you would like to add:
Running the same command a second time cleans everything up.
~Also capi-webhook-system namespace is left around.~
UPDATE: As per the test, capi-webhook-system is intentionally left around.
https://github.com/kubernetes-sigs/cluster-api/blob/2d2c9c86d49edfaeaec70001d66d3feb1211e4e9/cmd/clusterctl/pkg/client/cluster/components_test.go#L236
Environment:
- Cluster-api version: a39618d45eda45400759223a8a73c99e591e2101
- Minikube/KIND version: kind v0.7.0 go1.13.6 darwin/amd64 /kind bug
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (15 by maintainers)
I’m +1 to retry
This is happening because of a timing issue. We are actively deleting providers, which includes deleting their CRDs. Deleting a CRD removes it from API discovery. It can take some time between when a CRD is deleted and when it is removed from
/apis.In the example above, we deleted KCP, and then we try to remove another provider (cluster-api). As part of deleting, we use the discovery API client to get the server’s list of preferred resources. That code first gets a list of all the API groups, and then iterates through them, making a separate discovery API call for each GroupVersion. It’s possible that a CRD’s group is present during step one (list groups), and then gone by the time the second call happens.
The fix here is probably either:
discovery.ErrGroupDiscoveryFailederrors@vincepri I’ll re-triage this today.
This seems a kind of race that happens when deleting more providers in a row
Deletes controlplane.cluster.x-k8s.io/v1alpha3 CRD, but when the next delete operation is executed, it seems the type is still around/still in the client discovery cache, and this leads to error.
Wondering if we need to explicitly wait for CRD deletion to complete before moving on with the next delete