crossplane: Reconcile blocked by `crossplane.io/external-create-pending` annotation

What happened?

Managed resource stopped reconciling with error event:

Events:
  Type     Reason                           Age                 From                                 Message
  ----     ------                           ----                ----                                 -------
  Warning  CannotInitializeManagedResource  29m (x19 over 19h)  managed/queue.sqs.aws.crossplane.io  cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed

This looks like a duplicate of #2843, but to be clear I am seeing this on an entirely different provider aws vs. gcp in that linked issue. I think it would be good to understand the root cause of this state rather than accepting the removal of the crossplane.io/external-create-pending annotation as the accepted solution as that is much harder to scale in a large production deployment IMO.

How can we reproduce it?

I don’t have clear steps for reproduction and am opening this issue in hopes others come forward with more details and/or maintainers have some ideas on what/where to try/look for reproducing.

What environment did it happen in?

Crossplane version: 1.6.4 (but likely the error state started before upgrading to this version)

  • Cloud provider or hardware configuration: AWS
  • Kubernetes version (use kubectl version): v1.22.6-eks-7d68063
  • Kubernetes distribution (e.g. Tectonic, GKE, OpenShift): EKS
  • OS (e.g. from /etc/os-release): Amazon Linux2
  • Kernel (e.g. uname -a): 5.4.181-99.354.amzn2

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 5
  • Comments: 19 (4 by maintainers)

Most upvoted comments

Crossplane (or more specifically, a Crossplane provider) adds the crossplane.io/external-create-pending annotation to managed resources right before it attempts to create their corresponding external resource (i.e. make a create call to some cloud API). It then adds another annotation - either crossplane.io/external-create-failed or crossplane.io/external-create-succeeded once the creation is observed to have succeeded or failed. Unfortunately this all has to happen in a single reconcile loop iteration. If the reconcile is interrupted before it writes the failed/succeeded annotation the managed resource ends up in this state. That’s likely what happened for @Jell - the provider deployment was restarted while it was in the middle of trying to create some external resources.

In an ideal world we wouldn’t need to do this - we’d just create the external resource and return from the reconcile loop. Next reconcile we’d look up the external resource by its identifier and determine whether the create failed or succeeded. This is actually how Crossplane originally worked. The problem is that a surprising (and kind of depressing) amount of cloud APIs don’t use deterministic identifiers. For example when you call the AWS API to create a VPC the API returns a payload that tells you what the newly created VPC’s ID is. If we don’t successfully record that ID (e.g. because the provider was restarted mid-create) we have no way to tell on the next reconcile whether we successfully created the VPC or not. That’s basically what this annotation does - ensures that if creation was ambiguous that we stop and ask a human to intervene rather than potentially creating more infrastructure than you actually asked for (e.g. creating a second VPC and leaking the original one).

In some cases its possible to workaround this by looking things up by labels or other identifying properties instead, but this requires a lot of case-by-case code to be added to managed resource controllers and even then isn’t possible in all cases (e.g. for resources without deterministic identifiers that also don’t support tagging/labels).

I think the two things we can do to improve this are:

  1. Document what I wrote here.
  2. Allow resources that do have deterministic identifiers to opt-out of this behavior.

Flash forward almost 1.5 years from my creation of this issue and I’m hitting what I think is a different variation of this problem. Currently running crossplane:v1.14.4 and crossplane-contrib/provider-aws:v0.46.0 (crossplane-contrib/provider-aws:v0.45.2 previously) and while testing creation of Distribution.cloudfront.aws.crossplane.io/v1alpha1, I am experiencing this behavior:

apiVersion: cloudfront.aws.crossplane.io/v1alpha1
kind: Distribution
metadata:
  annotations:
    crossplane.io/composition-resource-name: Distribution
    crossplane.io/external-create-failed: "2023-12-18T21:48:06Z"
    crossplane.io/external-create-pending: "2023-12-18T21:48:06Z"
    crossplane.io/external-create-succeeded: "2023-12-18T21:34:56Z"
...
Events:
  Type     Reason                        Age                 From                                               Message
  ----     ------                        ----                ----                                               -------
  Normal   CreatedExternalResource       14m                 managed/distribution.cloudfront.aws.crossplane.io  Successfully requested creation of external resource
  Warning  CannotUpdateManagedResource   14m                 managed/distribution.cloudfront.aws.crossplane.io  Operation cannot be fulfilled on distributions.cloudfront.aws.crossplane.io "web-www-dev-usw-pc2ks-259cf": the object has been modified; please apply your changes to the latest version and try again
  Normal   PendingExternalResource       14m (x4 over 14m)   managed/distribution.cloudfront.aws.crossplane.io  Waiting for external resource existence to be confirmed
  Warning  CannotCreateExternalResource  22s (x14 over 13m)  managed/distribution.cloudfront.aws.crossplane.io  cannot create Distribution in AWS: DistributionAlreadyExists: The caller reference that you are using to create a distribution is associated with another distribution. Already exists: EW7C2ZJPWS6YK

You’ll notice a few interesting things:

  1. The existence of a pending, succeeded and failed external annotation state all present simultaneously.
  2. The Events demonstrate the controller is attempting to create a new external resource even while also waiting for the external resource existence to be confirmed PendingExternalResource?
  3. The CannotCreateExternalResource is expected in the sense Crossplane is trying to recreate the same resource again using the same canonical identifier caller reference for this specific provider API and type (but the resource already exists from the first CreatedExternalResource 14m earlier.

Interesting to see this is somewhat similar to what @ZhiminXiang observed back in May even though I am using an entirely different provider (contrib/provider-aws)

Finally, to rule out the original suspected cause of this GitHub issue, I have confirmed there have been no controller Pod restarts and all Pods are older than the initial creation of my resource above:

crossplane-794cbcb9c8-hn6xd                            1/1     Running   0          6d2h
crossplane-rbac-manager-7d6875d4b8-wzz8f               1/1     Running   0          6d2h
function-auto-ready-ad9454a37aa7-f467cb448-l8rkn       1/1     Running   0          5d21h
function-go-templating-f34ec030415a-858c56b6cb-h56wz   1/1     Running   0          5d21h
provider-aws-b4eafc5192c9-7c89c954c4-sxnjx             1/1     Running   0          31m

Let me know if I can provide anything further or if anyone has any additional ideas.

/fresh

Same issue on crossplane version 1.7.0 and provider-aws 0.26.0. Running on a local kind cluster.