crossplane: Consider allowing new package revisions to own objects even if they cannot control
What problem are you facing?
This feature is based on a scenario a community member recently shared with me. This write-up will hopefully illustrate some details of how the package manager works, as well as how to avoid any “catastrophe” situations.
The issue in this particular example was that the upgrade of provider-aws
from v0.16.0
to v0.17.0
required manual intervention due to the dropping of support for an alpha
CRD version as described in https://github.com/crossplane/crossplane/issues/2165. The inability for the v0.17.0
ProviderRevision
to gain control of CRDs meant that none of its CRDs were installed / updated, so the previous ProviderRevision
(v0.16.0
) was the only owner/controller of the provider-aws
CRDs that were present in the cluster.
Note: if a new revision cannot control all of the CRDs it reconciles in the cluster, we don’t let it control any. As you will see in this example, one option we could pursue is letting a new revision go ahead and own the CRDs even if it can’t establish control.
The community member decided at this point to try deleting the old ProviderRevision
(v0.16.0
) to try to fix the problem. This meant that the installed CRDs had only one owner reference (the v0.16.0
ProviderRevision
), and that object was now deleted, so Kubernetes marked all of the CRDs for garbage collection. CRDs with instances (CRs) existing in the cluster were not immediately deleted because the instances block owner deletion, and the instances themselves were being blocked on deletion by the presence of the managed reconciler finalizer. The finalizer was not being removed because the old ProviderRevision
controller was stopped due to the fact that it had been transitioned to Inactive
and the new ProviderRevision
controller had not been started because it could not gain control of its objects. So the instances had a deletion timestamp set, but could not be cleaned up and were not being deleted externally.
The issue arose when the CRDs that were blocking upgrade were resolved (which can happen automatically if no instances were present and the previous ProviderRevision
was deleted, thus making the blocking CRD get garbage collected assuming any delay before the new revision sets its controller reference). The new ProviderRevision
was able to either create or gain control of all CRDs, thus allowing it to start its controller. At that point, it reconciled all of the CRD instances (managed resources) that had their deletion timestamp set, thus deleting them externally (assuming deletionPolicy: Delete
).
So in summary, it is dangerous to clean up an old ProviderRevision
before the new one has become fully installed and healthy. Do note that there are ways to avoid this, namely always making sure the next revision has been established before deleting the old one (this is why the default revisionHistoryLimit
is set to 1
), but we should do everything we can to make sure it is hard to perform unintentionally destructive operations.
How could Crossplane help solve your problem?
The clear path to alleviating this problem (other than more documentation, which we should also invest in) is allowing a new ProviderRevision
to go ahead and become the owner of all objects it is able to in the case that it cannot become the controller of all of them. This would mean that if a user cleaned up a previous revision before the new one became healthy, the CRDs would not be garbage collected and instances of managed resources would not be marked for deletion.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 2
- Comments: 17 (11 by maintainers)
the issue is still relevant, reopening