crossplane: ProviderConfigs should not be deleted until all resources referring to it are deleted

What problem are you facing?

When I delete an infra-stack, all CRDs get deletion requests, including Provider ones. Since there is no finalizer on Provider resources, they are deleted immediately. This leads to hanging resources with deletion timestamp because their controllers cannot make deletion calls anymore since they’d have to have Provider resource to give them the credentials.

How could Crossplane help solve your problem?

Crossplane could block the deletion of a Provider resources until all resources that used this Provider are deleted from the cluster.

One mechanism could be that every resource that is successfully created could add its identification info to finalizers array and remove its entry when the resource is deleted.

The other alternative would be to have a Provider controller that scans all the resources who refer to this Provider instance and block the deletion until it finds none. We could somehow make use of labels for easy listing instead of providerRef.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 17 (17 by maintainers)

Commits related to this issue

Most upvoted comments

I believe this is a close call complexity wise, and is subjective from a UX perspective - especially if we can agree that it’s worth making the consumers of a ProviderConfig easily discoverable even when it is not being deleted.

If we go with the ProviderConfigUsage approach:

  • In the happy path - the vast majority of cases - we make one API call to create a ProviderConfigUsage when a managed resource first connects. Subsequent connects may read from cache and avoid a no-op update.
  • The ProviderConfig controller can determine whether it should remove its finalizer and inform users whether it’s in use by listing one type - ProviderConfigUsage - from cache. The list can be filtered by a well known label.
  • The ProviderConfig controller must watch two types - ProviderConfig and ProviderConfigUsage - in order to react immediately when a ProviderConfig may be safely deleted.

If we invert the responsibility of discovering usages such that the ProviderConfig controller handles it:

  • The ProviderConfig controller must be aware of the N types that may use a ProviderConfig. We can probably register these types when instantiating the controller to avoid having to use API discovery or similar.
  • The ProviderConfig controller must make N list calls from cache to list N heterogeneous types, and parse the spec.providerConfigRef from each resource of each type to determine whether it uses the provider.
  • The ProviderConfig controller must watch 1+N types - ProviderConfig and the N types that may use it - in order to react immediately when a ProviderConfig may be safely deleted.

So complexity wise it seems like the ProviderConfigUsage approach trades increased storage (many instances of a small resource) for reduced compute (fewer types to list and parse, fewer watches).

UX wise it gets pretty subjective. The ProviderConfigUsage approach is arguably the more idiomatic - it reflects the canonical deployment-to-pod relationship in which the user determines which pods are part of a deployment by running kubectl get pod -l some=labelselector, not by looking at an array of pod references in the Deployment’s status. There are also parallels with the RBAC RoleBinding type that binds a role to one or more subjects.

If the ProviderConfig controller were responsible for discovering which resources were using it they would presumably be displayed as an array of object references in the status of a ProviderConfig. This the more common approach in Crossplane controllers (e.g. the relationship between a composite resource and its composed resources, or a package revision and the resources it installs). One variation here is that the composite resource and package revision Crossplane controllers are responsible for the lifecycle of the resources they track via object references - they create them, rather than discover them after the fact.

My assessment is that neither approach is objectively better from a performance standpoint. I still feel that the ProviderConfigUsage approach is slightly better from a UX and implementation-complexity standpoint.

Ah - I like the potential for symmetry with storage protection. It would be good for providers to similarly transition to a specific status to indicate they’re being removed, perhaps Status: Terminating would work for them too.