cluster-api: clusterctl delete everything returns an error intermittenly

What steps did you take and what happened:

Install components via clusterctl init --infrastructure=aws:v0.5.0
Try and delete all the providers, namespaces and crds. And repeat the process a few times.

Deleted all providers but returned an error which left resources behind

  $ clusterctl delete --all --include-namespace --include-crd
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: controlplane.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource

Deleted some providers but returned an error which left resources behind

 $ clusterctl delete --all --include-crd --include-namespace
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: bootstrap.cluster.x-k8s.io/v1alpha2: the server could not find the requested resource, bootstrap.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource

Everything deleted successfully!

$ clusterctl delete --all --include-crd --include-namespace
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"

What did you expect to happen: Everything to delete successfully

Anything else you would like to add: Running the same command a second time cleans everything up. ~Also capi-webhook-system namespace is left around.~ UPDATE: As per the test, capi-webhook-system is intentionally left around. https://github.com/kubernetes-sigs/cluster-api/blob/2d2c9c86d49edfaeaec70001d66d3feb1211e4e9/cmd/clusterctl/pkg/client/cluster/components_test.go#L236

Environment:

Cluster-api version: a39618d45eda45400759223a8a73c99e591e2101
Minikube/KIND version: kind v0.7.0 go1.13.6 darwin/amd64 /kind bug

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (15 by maintainers)

Most upvoted comments

I’m +1 to retry

wfernandes on Jul 31, 2020

This is happening because of a timing issue. We are actively deleting providers, which includes deleting their CRDs. Deleting a CRD removes it from API discovery. It can take some time between when a CRD is deleted and when it is removed from /apis.

In the example above, we deleted KCP, and then we try to remove another provider (cluster-api). As part of deleting, we use the discovery API client to get the server’s list of preferred resources. That code first gets a list of all the API groups, and then iterates through them, making a separate discovery API call for each GroupVersion. It’s possible that a CRD’s group is present during step one (list groups), and then gone by the time the second call happens.

The fix here is probably either:

Tolerate discovery.ErrGroupDiscoveryFailed errors
Retry https://github.com/kubernetes-sigs/cluster-api/blob/e955160f6f30f61eabce65231899a1fe9d513046/cmd/clusterctl/client/cluster/proxy.go#L164 a few times before giving up

ncdc on Jul 31, 2020

@vincepri I’ll re-triage this today.

wfernandes on May 5, 2020

This seems a kind of race that happens when deleting more providers in a row

...
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"

Deletes controlplane.cluster.x-k8s.io/v1alpha3 CRD, but when the next delete operation is executed, it seems the type is still around/still in the client discovery cache, and this leads to error.

Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: controlplane.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource

Wondering if we need to explicitly wait for CRD deletion to complete before moving on with the next delete

fabriziopandini on Mar 2, 2020