aws-cdk: aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error

https://github.com/aws/aws-cdk/blob/beb01b549abc5a5c825e951649786589d2150a72/packages/%40aws-cdk/aws-s3-deployment/lib/lambda/index.py#L150-L163

I’ve come across a deployment where cloudfront was invalidated but the lambda timed out with cfn_error: Waiter InvalidationCompleted failed: Max attempts exceeded. ~I suspect a race conditon, and that reversing the order of cloudfront.create_invalidation() and cloudfront.get_waiter() would fix this race condition.~

edit: proposed fix of reversing create_invalidation() and get_waiter() is invalid, see https://github.com/aws/aws-cdk/issues/15891#issuecomment-898413309

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 24
  • Comments: 61 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Started to see this problem when using s3 bucket deployments with CDK

Can we re-open this issue? It’s still a problem with the underlying lambda even if its related to another service. What if we provide an option to not fail the custom resource if the invalidation fails?

Is it possible to re-open this issue? We’re experiencing this problem as well.

It seems there’s currently a problem in AWS cloudfront I get the same timeouts errors

From the CloudFront team:

CreateInvalidation API suffers from high fault rate during the daily traffic peaks. It will return faults upto 50% of the requests. It is primarily due to the limited capacity of the API.

and

We have evidence that some requests failed even after six retries , during the peak. We are working on improving this , but there is no quick fix for this and we are expecting it will get better by the end of Q1 2022.

I raised this issue internally with the CloudFront team. I’ll keep you guys updated in this conversation.

My team are also seeing this error regularly!

This issue got worse for us so this is our solution for now:

    const createInvalidation = new sfnTasks.CallAwsService(this, 'CreateInvalidation', {
      service: 'cloudfront',
      action: 'createInvalidation',
      parameters: {
        DistributionId: distribution.distributionId,
        InvalidationBatch: {
          CallerReference: sfn.JsonPath.entirePayload,
          Paths: {
            Items: ['/*'],
            Quantity: 1,
          },
        },
      },
      iamResources: [
        `arn:aws:cloudfront::${Aws.ACCOUNT_ID}:distribution/${distribution.distributionId}`,
      ],
    });

    const createInvalidationStateMachine = new sfn.StateMachine(
      this,
      'CreateInvalidationStateMachine',
      {
        definition: createInvalidation.addRetry({
          errors: ['CloudFront.CloudFrontException'],
          backoffRate: 2,
          interval: Duration.seconds(5),
          maxAttempts: 10,
        }),
      }
    );

    new events.Rule(this, 'DeploymentComplete', {
      eventPattern: {
        source: ['aws.cloudformation'],
        detail: {
          'stack-id': [`${Stack.of(this).stackId}`],
          'status-details': {
            status: ['UPDATE_COMPLETE'],
          },
        },
      },
    }).addTarget(
      new eventsTargets.SfnStateMachine(createInvalidationStateMachine, {
        input: events.RuleTargetInput.fromEventPath('$.id'),
      })
    );
  }

Still happening in 2024… Not sure why I’m using Cloudfront at this point…

This is happening us frequently now also

Reopening because additional customers have been impacted by this issue. @naseemkullah are you still running into this issue?

From other customer experiencing the issue Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded

this issue is intermittent and when we redeploy it works. Our pipelines are automated and we deploy 3-5 times everyday in production. When our stack fails due to this error then cloudfront is unable to rollback, which create high severity issues in prod and there is a downtime until we redeploy the pipeline again. This error happens during the invalidation part but somehow cloudfront is not able to get the files from s3 origin when this error occurs. We have enabled versioning in s3 bucket so that cloudfront is able to serve the older version in case of rollback but its still unable to fetch files until we redeploy.

customer’s code:

  new s3deploy.BucketDeployment(this, 'DeployWithInvalidation', {
      sources: [s3deploy.Source.asset(`../packages/dist`)],
      destinationBucket: bucket,
      distribution,
      distributionPaths: [`/*`],
      retainOnDelete: false,
      prune: false,
    });

This deploys the files in s3 bucket and creates a cloudfront invalidation which is when the stack fails on the waiter error.

@otaviomacedo Did you ever get an update from them? Just ran into this (also once at deploy, once at rollback), and it’s a major PITA.

We have evidence that some requests failed even after six retries , during the peak. We are working on improving this , but there is no quick fix for this and we are expecting it will get better by the end of Q1 2022.

We no longer experience this issue after increasing the memory limit of the bucket deployment.

new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})

The defalut memory limit is 128. (docs)

encounter the same issue, some action log timestamps:

2023-06-09 08:15:11 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_FAILED | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: 3b01a325-6c24-45f0-8f6c-86638f2e282b)
-- | -- | -- | --
2023-06-09 08:04:38 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_IN_PROGRESS | -

took 10m to failed the CDK stack, and the invalidation was created 1 min after the failure.

  | IEKSZWOI5U3Q6GNNNQMQLJ11WH | Completed | June 9, 2023 at 3:16:20 PM UTC

We are also experiencing this issue intermittently with our cloudfront invalidations (once every two weeks or so) 😞

In my case, the invalidation kicked off two and both were in progress for a long time and eventually timed out. Screen Shot 2021-09-08 at 9 15 14 AM

In this case, the most plausible hypothesis is that CloudFront is actually taking longer than 10 min to invalidate the files in some cases. We can try to reduce the chance of this happening by increasing the waiting time, but Lambda has a maximum timeout of 15 min. Beyond that, it’s not clear to me what else we can do. In any case, contributions are welcome!

it has happened twice in recent days, next time it occurs i will try to confirm this, iirc the first time this happened i checked and I saw the invalidation event had occurred almost immediately yet the waiter did not see that (that’s why i thought it might be a race condition). Will confirm though!

I think the risk involved in this change is quite low. Please submit the PR and I’ll be happy to review it.