cluster-api: Unable to delete a cluster when infrastructureRef is defined incorrectly

What steps did you take and what happened: Creating a cluster definition with the incorrect infrastructureSpec results in a cluster resource that can’t be deleted, also exhibiting the same behaviour on the namespace where it exists.

Example Cluster:

apiVersion: cluster.x-k8s.io/v1alpha2
kind: Cluster
metadata:
  name: blank
  namespace: blank
spec:
  clusterNetwork:
    services:
      cidrBlocks: ["10.96.0.0/12"]
    pods:
      cidrBlocks: ["192.168.0.0/16"]
    serviceDomain: "cluster.local"
    apiServerPort: 6433
  infrastructureRef:
    apiVersion: blank.cluster.k8s.io/v1alpha1
    kind: blankCluster
    name: blankTest
    namespace: blank

Applying this will create a new cluster resource in the namespace blank as expected:

k create namespace blank; k create -f ./blank.yaml

What did you expect to happen:

That deleting this erroneous cluster resource or it’s namespace, that it would be cleaned up from the cluster. However at this point it will hang indefinitely (even with force):

k get cluster -n blank
NAME    PHASE
blank   provisioning
k delete cluster blank -n blank
cluster.cluster.x-k8s.io "blank" deleted
<hang>

Anything else you would like to add:

As pointed out by @detiber, editing the resource and removing the infrastructureRef that it will be removed as expected:

k edit cluster blank -n blank
cluster.cluster.x-k8s.io/blank edited
k delete cluster blank -n blank
Error from server (NotFound): clusters.cluster.x-k8s.io "blank" not found

Environment:

Cluster-api version: 0.2.5
Minikube/KIND version: N/A (vanilla deployment on VMs)
Kubernetes version: (use kubectl version): 1.14.1
OS (e.g. from /etc/os-release): Ubuntu 18.04

/kind bug

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 24 (23 by maintainers)

Most upvoted comments

@prankul88 I do not believe we’ll be able to implement what you’ve described. When you issue kubectl delete, it has a --wait flag that defaults to true. If a resource has finalizers, the apiserver sets the deletion timestamp on the resource, and the --wait=true flag causes kubectl to wait for the resource’s finalizers to be removed, and for the resource ultimately to be removed from etcd. If there is still a finalizer on the resource, which is what happens in case 1, there is nothing kubectl can do in its current form to give you any additional information as to what is going on. If you ctrl-c the kubectl delete call, the resource still has its deletion timestamp set, and the apiserver is still waiting for all the finalizers to be removed. This is the standard behavior for all Kubernetes resources, both built-in types and custom resources, and there is no way to alter the behavior of either the apiserver or kubectl without making changes to Kubernetes.

I think it may be sufficient to modify ClusterReconciler.reconcileDelete() to have it skip over 404 not found errors here:

https://github.com/kubernetes-sigs/cluster-api/blob/065eb539766dede097e206a7b549b5902d15f14a/controllers/cluster_controller.go#L256

ncdc on Jan 30, 2020

Hello,

I raised the issue mainly as it is certainly confusing behaviour for end-users that don’t know where to look or even that they will need to start manually editing various object spec. It’s more of a UX issue if the end-user can’t be notified that the delete operation is failing due to a mis-aligned reference I suppose.

thebsdbox on Nov 22, 2019

@wfernandes Yes I am working on it.

/assign /lifecycle active

prankul88 on Nov 9, 2019

@thebsdbox I am facing the same issue. Will work on it.

prankul88 on Nov 1, 2019