aws-cdk: (aws-ecs): hanging on deleting a stack with ASG capacity provider
What is the problem?
The deletion of stack with AsgCapacityProvider hangs unexpectedly.
It is surprising as we didn’t have such an issue with now deprecated addCapacity and we have no ECS tasks in ASG when we delete the stack.
The behaviour seems to be caused by the default enableManagedTerminationProtection = true.
See the discussion in the original closed issue and my unaddressed comment: https://github.com/aws/aws-cdk/issues/14732#issuecomment-991402770.
Reproduction Steps
Please see https://github.com/aws/aws-cdk/issues/14732.
In short, try to delete the stack with ECS cluster which uses AsgCapacityProvider defaults.
What did you expect to happen?
Either:
- CloudFormation does not hang but fails as fast as possible with an error message about the termination protection.
- The stack is successfully deleted as there are no running ECS tasks anymore.
What actually happened?
The CF stack got stuck in DELETE_IN_PROGRESS.
CDK CLI Version
2.3.0
Framework Version
2.3.0
Node.js Version
v16.8.0
OS
macOS
Language
Java
Language Version
11.0.8
Other information
Workaround
My current workaround: set AsgCapacityProvider enableManagedTerminationProtection = false.
Documentation questions/enhancement requests
From https://docs.aws.amazon.com/cdk/api/latest/docs/aws-ecs-readme.html (emphasis mine):
By default, an Auto Scaling Group Capacity Provider will manage the Auto Scaling Group’s size for you. It will also enable managed termination protection, in order to prevent EC2 Auto Scaling from terminating EC2 instances that have tasks running on them. If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.
- It’s not fully clear from the description that the flag simply disables deletion of ASG. I got an incorrect impression that it somehow cleverly understands that there are no ECS tasks running and allows deletion in such case.
- What are the risks of turning this protection off? E.g. we don’t want ECS tasks to shut down at random times.
- Is it OK to set
enableManagedTerminationProtection=false+enableManagedScaling=true? It seems to work but is against the documentation (“If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.”).
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 9
- Comments: 16 (5 by maintainers)
Hi everyone, we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster. Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group. Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?
The solution suggested by @gshpychka works great for us. In our case, we were experiencing the same problem, not with a capacity provider but with a custom termination policy lambda.
Normally, the CDK wants to delete the ASG, which triggers a scale-in that waits for instances to terminate, but while that happens the CDK is dismantling the roles and permissions of the custom termination policy lambda, so it can no longer tell the ASG that any instances are safe to terminate.
In this case you can create the custom resource, then make it depend on the ASG. That forces your CR to be deleted before the ASG, which force-deletes the ASG, preventing it from calling the custom termination policy.
@ryparker I think in “Related to but does not fix: https://github.com/aws/aws-cdk/issues/18179” the bot may have captured “fix: https://github.com/aws/aws-cdk/issues/18179” ^^ Issue should probably be reopened.
Hey all, I’ve created a reference CloudFormation template that demonstrates how to avoid this issue. The end to end solution for the capacity provider with working teardown can be found here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling
You can also refer directly to the sample code for the Lambda function here: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123
In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown.
Yes, the order is not the issue. The issue is that Cloudformation doesn’t force-delete the ASG, so it fails to delete it if there are instances protected from scale-in. The custom resource force-deletes the ASG, terminating all instances, including protected ones.
My solution doesn’t require retrying the deletion, it works in a single pass.
The ASG can’t be removed if it’s still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.