crossplane: Reconcile blocked by `crossplane.io/external-create-pending` annotation
What happened?
Managed resource stopped reconciling with error event:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning CannotInitializeManagedResource 29m (x19 over 19h) managed/queue.sqs.aws.crossplane.io cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed
This looks like a duplicate of #2843, but to be clear I am seeing this on an entirely different provider aws
vs. gcp
in that linked issue. I think it would be good to understand the root cause of this state rather than accepting the removal of the crossplane.io/external-create-pending
annotation as the accepted solution as that is much harder to scale in a large production deployment IMO.
How can we reproduce it?
I don’t have clear steps for reproduction and am opening this issue in hopes others come forward with more details and/or maintainers have some ideas on what/where to try/look for reproducing.
What environment did it happen in?
Crossplane version: 1.6.4 (but likely the error state started before upgrading to this version)
- Cloud provider or hardware configuration: AWS
- Kubernetes version (use
kubectl version
): v1.22.6-eks-7d68063 - Kubernetes distribution (e.g. Tectonic, GKE, OpenShift): EKS
- OS (e.g. from /etc/os-release): Amazon Linux2
- Kernel (e.g.
uname -a
): 5.4.181-99.354.amzn2
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 5
- Comments: 19 (4 by maintainers)
Crossplane (or more specifically, a Crossplane provider) adds the
crossplane.io/external-create-pending
annotation to managed resources right before it attempts to create their corresponding external resource (i.e. make a create call to some cloud API). It then adds another annotation - eithercrossplane.io/external-create-failed
orcrossplane.io/external-create-succeeded
once the creation is observed to have succeeded or failed. Unfortunately this all has to happen in a single reconcile loop iteration. If the reconcile is interrupted before it writes the failed/succeeded annotation the managed resource ends up in this state. That’s likely what happened for @Jell - the provider deployment was restarted while it was in the middle of trying to create some external resources.In an ideal world we wouldn’t need to do this - we’d just create the external resource and return from the reconcile loop. Next reconcile we’d look up the external resource by its identifier and determine whether the create failed or succeeded. This is actually how Crossplane originally worked. The problem is that a surprising (and kind of depressing) amount of cloud APIs don’t use deterministic identifiers. For example when you call the AWS API to create a VPC the API returns a payload that tells you what the newly created VPC’s ID is. If we don’t successfully record that ID (e.g. because the provider was restarted mid-create) we have no way to tell on the next reconcile whether we successfully created the VPC or not. That’s basically what this annotation does - ensures that if creation was ambiguous that we stop and ask a human to intervene rather than potentially creating more infrastructure than you actually asked for (e.g. creating a second VPC and leaking the original one).
In some cases its possible to workaround this by looking things up by labels or other identifying properties instead, but this requires a lot of case-by-case code to be added to managed resource controllers and even then isn’t possible in all cases (e.g. for resources without deterministic identifiers that also don’t support tagging/labels).
I think the two things we can do to improve this are:
Flash forward almost 1.5 years from my creation of this issue and I’m hitting what I think is a different variation of this problem. Currently running
crossplane:v1.14.4
andcrossplane-contrib/provider-aws:v0.46.0
(crossplane-contrib/provider-aws:v0.45.2
previously) and while testing creation ofDistribution.cloudfront.aws.crossplane.io/v1alpha1
, I am experiencing this behavior:You’ll notice a few interesting things:
Events
demonstrate the controller is attempting to create a new external resource even while also waiting for the external resource existence to be confirmedPendingExternalResource
?CannotCreateExternalResource
is expected in the sense Crossplane is trying to recreate the same resource again using the same canonical identifiercaller reference
for this specific provider API and type (but the resource already exists from the firstCreatedExternalResource
14m earlier.Interesting to see this is somewhat similar to what @ZhiminXiang observed back in May even though I am using an entirely different provider (contrib/provider-aws)
Finally, to rule out the original suspected cause of this GitHub issue, I have confirmed there have been no controller Pod restarts and all Pods are older than the initial creation of my resource above:
Let me know if I can provide anything further or if anyone has any additional ideas.
/fresh
Same issue on crossplane version 1.7.0 and provider-aws 0.26.0. Running on a local kind cluster.