terraform-provider-aws: aws_ecs_cluster with capacity_providers cannot be destroyed

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave “+1” or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Relates #5278 Relates #11351 Relates #11531 Relates #22672 Relates #22754

Maintainer Note

Fixing this problem involves several pieces:
1. Creating a new resource that avoids the spurious capacity providers dependency chain (and allows capacity providers association with existing clusters), which is #22672
2. Deprecating the capacity_providers and default_capacity_provider_strategy arguments of aws_ecs_cluster (#22754)
3. Removing the capacity_providers and default_capacity_provider_strategy arguments from aws_ecs_cluster, which is a breaking change
While the complete solution includes a breaking change, that doesn’t prevent us from moving forward with i. and ii. (v4.0) above and then keeping iii. in mind for v5.0.

Terraform Version

Terraform v0.12.18

provider.aws v2.43.0

Affected Resource(s)

aws_ecs_cluster
aws_ecs_capacity_provider
aws_autoscaling_group

Terraform Configuration Files

resource "aws_ecs_cluster" "indestructable" { 
  name = "show_tf_cp_flaw"

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }
}

resource "aws_ecs_capacity_provider" "cp" {
  name = "show_tf_cp_flaw"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.asg.arn

    managed_scaling {
      status          = "ENABLED"
      target_capacity = 80
    }
  }
}

resource "aws_autoscaling_group" "asg" {
  min_size = 2
  ....
}

Debug Output

Panic Output

Expected Behavior

terraform destroy should be able to destroy an aws_ecs_cluster which has capacity_providers set.

Actual Behavior

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

The problem is that this new capacity_provider property on the aws_ecs_cluster introduces a new dependency: aws_ecs_cluster depends on aws_ecs_capacity_provider depends on aws_autoscaling_group

This causes terraform to destroy the ECS cluster before the autoscaling group, which is the wrong way around: the autoscaling group must be destroyed first because the cluster must contain zero instances before it can be destroyed.

A possible solution may be to introduce a new resource type representing the attachment of a capacity provider to a cluster (inspired by aws_iam_role_policy_attachment which is the attachment of an IAM policy to a role).

This would allow the following dependency graph which would work beautifully: aws_ecs_capacity_provider_cluster_attachment depends on aws_ecs_cluster and aws_ecs_capacity_provider; aws_ecs_capacity_provider depends on aws_autoscaling_group depends on aws_launch_template depends on aws_ecs_cluster (e.g. via the user_data property which needs to set the ECS_CLUSTER environment variable to the name of the cluster).

Steps to Reproduce

terraform apply
terraform destroy

Important Factoids

References

The problematic capacity_providers field on aws_ecs_cluster was added recently in #11150
Using aws_ecs_capacity_provider with managed_termination_protection = "ENABLED" requires that the aws_autoscaling_group has protect_from_scale_in enabled, which has a separate issue with destroy: #5278

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 106
Comments: 18 (8 by maintainers)

Commits related to this issue

r/aws_ecs_cluster_capacity_providers: add test for #11409 — committed to roberth-k/terraform-provider-aws by roberth-k 2 years ago
r/aws_ecs_cluster_capacity_providers: add test for #11409 — committed to roberth-k/terraform-provider-aws by roberth-k 2 years ago

Most upvoted comments

Meanwhile here is a nasty workaround using a destroy provisioner, that worked for me to allow the aws_ecs_cluster to be destroyed:

resource "aws_ecs_cluster" "cluster" {
  name = local.cluster_name

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }

  # We need to terminate all instances before the cluster can be destroyed.
  # (Terraform would handle this automatically if the autoscaling group depended
  #  on the cluster, but we need to have the dependency in the reverse
  #  direction due to the capacity_providers field above).
  provisioner "local-exec" {
    when = destroy

    command = <<CMD
      # Get the list of capacity providers associated with this cluster
      CAP_PROVS="$(aws ecs describe-clusters --clusters "${self.arn}" \
        --query 'clusters[*].capacityProviders[*]' --output text)"

      # Now get the list of autoscaling groups from those capacity providers
      ASG_ARNS="$(aws ecs describe-capacity-providers \
        --capacity-providers "$CAP_PROVS" \
        --query 'capacityProviders[*].autoScalingGroupProvider.autoScalingGroupArn' \
        --output text)"

      if [ -n "$ASG_ARNS" ] && [ "$ASG_ARNS" != "None" ]
      then
        for ASG_ARN in $ASG_ARNS
        do
          ASG_NAME=$(echo $ASG_ARN | cut -d/ -f2-)

          # Set the autoscaling group size to zero
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name "$ASG_NAME" \
            --min-size 0 --max-size 0 --desired-capacity 0

          # Remove scale-in protection from all instances in the asg
          INSTANCES="$(aws autoscaling describe-auto-scaling-groups \
            --auto-scaling-group-names "$ASG_NAME" \
            --query 'AutoScalingGroups[*].Instances[*].InstanceId' \
            --output text)"
          aws autoscaling set-instance-protection --instance-ids $INSTANCES \
            --auto-scaling-group-name "$ASG_NAME" \
            --no-protected-from-scale-in
        done
      fi
CMD
  }
}

+18

lukedd on Dec 22, 2019

Hi all 👋 Just letting you know that this is issue is featured on this quarters roadmap. If a PR exists to close the issue a maintainer will review and either make changes directly, or work with the original author to get the contribution merged. If you have written a PR to resolve the issue please ensure the “Allow edits from maintainers” box is checked. Thanks for your patience and we are looking forward to getting this merged soon!

+14

breathingdust on Nov 10, 2021

Thank you for the input on this issue! We are carefully considering work on this in the near future. (No guarantees on an exact date.) In order to facilitate the implementation, I’ve outlined some thoughts below.

After looking through this, I agree with the suggested way forward:

Create a new resource (something along the lines of aws_ecs_capacity_provider_cluster_attachment)
The new resource could overlap existing functionality. However, removing the capacity_providers and default_capacity_provider_strategy arguments would be a breaking change that would need to wait until a major release. Please comment below on whether ideally these would stay or go.
It would be highly desirable for this implementation to also fix #5278, #11351, and #11531, if possible.

Please provide any feedback, yay or nay.

YakDriver on Oct 19, 2021

Ah, turns out this is precisely the issue described in https://github.com/hashicorp/terraform-provider-aws/issues/11531. In short, the design of capacity providers is broken in Terraform right now, as it creates an invalid dependency chain: aws_ecs_cluster -> aws_ecs_capacity_provider -> aws_autoscaling_group. This chain isn’t valid, because on destroy, Terraform will try to delete aws_ecs_cluster first, but it can’t, because the aws_autoscaling_group hasn’t been deleted. So we need an aws_ecs_capacity_provider_attachment to use capacity providers without such a dependency chain.

brikis98 on Apr 30, 2021

@edmundcraske-bjss Yes, you are absolutely correct! Thank you.

At this point, the best way forward looks like #22672. That will address the op recommended solution of an attachment resource (though named aws_ecs_cluster_capacity_providers instead). That will solve the main problem here of using capacity providers but not being able to destroy the cluster. It will also solve the problem of not being able to associate capacity providers with existing clusters.

YakDriver on Jan 25, 2022

Is it possible that your test did not hit the issue because the EC2 instances were not actually registering with the ECS cluster?

edmundcraske-bjss on Jan 25, 2022

Same as #4852 . Someone consolidate all these, this is really noisy.

ericb-summit on Oct 8, 2021

Having this issue too. On destroy, I get the error:

Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

This started around Terraform 0.12, and we added retries to work around it. We’re now upgrading to 0.15, and the retries no longer seem to help, so this is a blocker.

brikis98 on Apr 28, 2021

Any news? I still waiting for this issue to be fixed

carlitos081 on Jun 30, 2020

Any updates here? This is terribly annoying to deal with. (The workaround does not work in my particular case)

kkost on Jun 16, 2020