aws-cdk: aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error

https://github.com/aws/aws-cdk/blob/beb01b549abc5a5c825e951649786589d2150a72/packages/%40aws-cdk/aws-s3-deployment/lib/lambda/index.py#L150-L163

I’ve come across a deployment where cloudfront was invalidated but the lambda timed out with cfn_error: Waiter InvalidationCompleted failed: Max attempts exceeded. ~I suspect a race conditon, and that reversing the order of cloudfront.create_invalidation() and cloudfront.get_waiter() would fix this race condition.~

edit: proposed fix of reversing create_invalidation() and get_waiter() is invalid, see https://github.com/aws/aws-cdk/issues/15891#issuecomment-898413309

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 24
Comments: 61 (15 by maintainers)

Commits related to this issue

Add parameter - dont fail deployment on CFront err Related to: https://github.com/aws/aws-cdk/issues/15891 — committed to msheiny/aws-cdk by deleted user a year ago
Use sleep instead of sleep 180 instead of waiting for invalidation because of https://github.com/aws/aws-cdk/issues/15891 — committed to mdbudnick/personal-website by mdbudnick 8 months ago

Most upvoted comments

Started to see this problem when using s3 bucket deployments with CDK

+21

Negan1911 on Apr 19, 2023

Can we re-open this issue? It’s still a problem with the underlying lambda even if its related to another service. What if we provide an option to not fail the custom resource if the invalidation fails?

+14

msheiny on Feb 15, 2023

Is it possible to re-open this issue? We’re experiencing this problem as well.

benjaminpottier on Sep 30, 2022

It seems there’s currently a problem in AWS cloudfront I get the same timeouts errors

hugomallet on Oct 23, 2023

From the CloudFront team:

CreateInvalidation API suffers from high fault rate during the daily traffic peaks. It will return faults upto 50% of the requests. It is primarily due to the limited capacity of the API.

and

We have evidence that some requests failed even after six retries , during the peak. We are working on improving this , but there is no quick fix for this and we are expecting it will get better by the end of Q1 2022.

otaviomacedo on Nov 11, 2021

I raised this issue internally with the CloudFront team. I’ll keep you guys updated in this conversation.

otaviomacedo on Sep 24, 2021

My team are also seeing this error regularly!

emmapatterson on Mar 3, 2023

This issue got worse for us so this is our solution for now:

    const createInvalidation = new sfnTasks.CallAwsService(this, 'CreateInvalidation', {
      service: 'cloudfront',
      action: 'createInvalidation',
      parameters: {
        DistributionId: distribution.distributionId,
        InvalidationBatch: {
          CallerReference: sfn.JsonPath.entirePayload,
          Paths: {
            Items: ['/*'],
            Quantity: 1,
          },
        },
      },
      iamResources: [
        `arn:aws:cloudfront::${Aws.ACCOUNT_ID}:distribution/${distribution.distributionId}`,
      ],
    });

    const createInvalidationStateMachine = new sfn.StateMachine(
      this,
      'CreateInvalidationStateMachine',
      {
        definition: createInvalidation.addRetry({
          errors: ['CloudFront.CloudFrontException'],
          backoffRate: 2,
          interval: Duration.seconds(5),
          maxAttempts: 10,
        }),
      }
    );

    new events.Rule(this, 'DeploymentComplete', {
      eventPattern: {
        source: ['aws.cloudformation'],
        detail: {
          'stack-id': [`${Stack.of(this).stackId}`],
          'status-details': {
            status: ['UPDATE_COMPLETE'],
          },
        },
      },
    }).addTarget(
      new eventsTargets.SfnStateMachine(createInvalidationStateMachine, {
        input: events.RuleTargetInput.fromEventPath('$.id'),
      })
    );
  }

benjaminpottier on Dec 21, 2022

Still happening in 2024… Not sure why I’m using Cloudfront at this point…

abury on Feb 15, 2024

This is happening us frequently now also

calebwilson706 on May 3, 2023

Reopening because additional customers have been impacted by this issue. @naseemkullah are you still running into this issue?

From other customer experiencing the issue Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded

this issue is intermittent and when we redeploy it works. Our pipelines are automated and we deploy 3-5 times everyday in production. When our stack fails due to this error then cloudfront is unable to rollback, which create high severity issues in prod and there is a downtime until we redeploy the pipeline again. This error happens during the invalidation part but somehow cloudfront is not able to get the files from s3 origin when this error occurs. We have enabled versioning in s3 bucket so that cloudfront is able to serve the older version in case of rollback but its still unable to fetch files until we redeploy.

customer’s code:

  new s3deploy.BucketDeployment(this, 'DeployWithInvalidation', {
      sources: [s3deploy.Source.asset(`../packages/dist`)],
      destinationBucket: bucket,
      distribution,
      distributionPaths: [`/*`],
      retainOnDelete: false,
      prune: false,
    });

This deploys the files in s3 bucket and creates a cloudfront invalidation which is when the stack fails on the waiter error.

peterwoodworth on Aug 30, 2021

@otaviomacedo Did you ever get an update from them? Just ran into this (also once at deploy, once at rollback), and it’s a major PITA.

We have evidence that some requests failed even after six retries , during the peak. We are working on improving this , but there is no quick fix for this and we are expecting it will get better by the end of Q1 2022.

LosD on Dec 13, 2022

We no longer experience this issue after increasing the memory limit of the bucket deployment.

new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})

The defalut memory limit is 128. (docs)

jkbailey on Feb 28, 2024

encounter the same issue, some action log timestamps:

2023-06-09 08:15:11 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_FAILED | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: 3b01a325-6c24-45f0-8f6c-86638f2e282b)
-- | -- | -- | --
2023-06-09 08:04:38 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_IN_PROGRESS | -

took 10m to failed the CDK stack, and the invalidation was created 1 min after the failure.

  | IEKSZWOI5U3Q6GNNNQMQLJ11WH | Completed | June 9, 2023 at 3:16:20 PM UTC

xli2227 on Jun 9, 2023

We are also experiencing this issue intermittently with our cloudfront invalidations (once every two weeks or so) 😞

emmanuelnk on Sep 20, 2022

In my case, the invalidation kicked off two and both were in progress for a long time and eventually timed out. Screen Shot 2021-09-08 at 9 15 14 AM

quixoticmonk on Sep 8, 2021

In this case, the most plausible hypothesis is that CloudFront is actually taking longer than 10 min to invalidate the files in some cases. We can try to reduce the chance of this happening by increasing the waiting time, but Lambda has a maximum timeout of 15 min. Beyond that, it’s not clear to me what else we can do. In any case, contributions are welcome!

it has happened twice in recent days, next time it occurs i will try to confirm this, iirc the first time this happened i checked and I saw the invalidation event had occurred almost immediately yet the waiter did not see that (that’s why i thought it might be a race condition). Will confirm though!

naseemkullah on Sep 3, 2021

I think the risk involved in this change is quite low. Please submit the PR and I’ll be happy to review it.

otaviomacedo on Aug 13, 2021