aws-cdk: (aws-ecs): hanging on deleting a stack with ASG capacity provider

What is the problem?

The deletion of stack with AsgCapacityProvider hangs unexpectedly.

It is surprising as we didn’t have such an issue with now deprecated addCapacity and we have no ECS tasks in ASG when we delete the stack.

The behaviour seems to be caused by the default enableManagedTerminationProtection = true.

See the discussion in the original closed issue and my unaddressed comment: https://github.com/aws/aws-cdk/issues/14732#issuecomment-991402770.

Reproduction Steps

Please see https://github.com/aws/aws-cdk/issues/14732.

In short, try to delete the stack with ECS cluster which uses AsgCapacityProvider defaults.

What did you expect to happen?

Either:

CloudFormation does not hang but fails as fast as possible with an error message about the termination protection.
The stack is successfully deleted as there are no running ECS tasks anymore.

What actually happened?

The CF stack got stuck in DELETE_IN_PROGRESS.

CDK CLI Version

2.3.0

Framework Version

2.3.0

Node.js Version

v16.8.0

OS

macOS

Language

Java

Language Version

11.0.8

Other information

Workaround

My current workaround: set AsgCapacityProvider enableManagedTerminationProtection = false.

Documentation questions/enhancement requests

From https://docs.aws.amazon.com/cdk/api/latest/docs/aws-ecs-readme.html (emphasis mine):

By default, an Auto Scaling Group Capacity Provider will manage the Auto Scaling Group’s size for you. It will also enable managed termination protection, in order to prevent EC2 Auto Scaling from terminating EC2 instances that have tasks running on them. If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.

It’s not fully clear from the description that the flag simply disables deletion of ASG. I got an incorrect impression that it somehow cleverly understands that there are no ECS tasks running and allows deletion in such case.
What are the risks of turning this protection off? E.g. we don’t want ECS tasks to shut down at random times.
Is it OK to set enableManagedTerminationProtection=false + enableManagedScaling=true? It seems to work but is against the documentation (“If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.”).

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 9
Comments: 16 (5 by maintainers)

Most upvoted comments

Hi everyone, we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster. Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group. Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?

fschollmeyer on Feb 14, 2022

The solution suggested by @gshpychka works great for us. In our case, we were experiencing the same problem, not with a capacity provider but with a custom termination policy lambda.

Normally, the CDK wants to delete the ASG, which triggers a scale-in that waits for instances to terminate, but while that happens the CDK is dismantling the roles and permissions of the custom termination policy lambda, so it can no longer tell the ASG that any instances are safe to terminate.

In this case you can create the custom resource, then make it depend on the ASG. That forces your CR to be deleted before the ASG, which force-deletes the ASG, preventing it from calling the custom termination policy.

    const asgForceDelete = new cr.AwsCustomResource(this, 'AsgForceDelete', {
      onDelete: {
        service: 'AutoScaling',
        action: 'deleteAutoScalingGroup',
        parameters: {
          AutoScalingGroupName: this.autoScalingGroup.autoScalingGroupName,
          ForceDelete: true
        }
      },
      policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
        resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE
      })
    });
    asgForceDelete.node.addDependency(this.autoScalingGroup);

elliot-nelson on Jun 9, 2022

@ryparker I think in “Related to but does not fix: https://github.com/aws/aws-cdk/issues/18179” the bot may have captured “fix: https://github.com/aws/aws-cdk/issues/18179” ^^ Issue should probably be reopened.

Ten0 on Jan 18, 2023

Hey all, I’ve created a reference CloudFormation template that demonstrates how to avoid this issue. The end to end solution for the capacity provider with working teardown can be found here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling

You can also refer directly to the sample code for the Lambda function here: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123

In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown.

nathanpeck on Jan 2, 2024

Yes, the order is not the issue. The issue is that Cloudformation doesn’t force-delete the ASG, so it fails to delete it if there are instances protected from scale-in. The custom resource force-deletes the ASG, terminating all instances, including protected ones.

My solution doesn’t require retrying the deletion, it works in a single pass.

gshpychka on Mar 18, 2022

A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup on delete (noop on create or on update), and make the capacity provider depend on the custom resource.

Shouldn’t it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster?

The ASG can’t be removed if it’s still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.

gshpychka on Mar 18, 2022