terraform-provider-aws: aws_ecs_cluster with capacity_providers cannot be destroyed

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Relates #5278 Relates #11351 Relates #11531 Relates #22672 Relates #22754

Maintainer Note

  • Fixing this problem involves several pieces:
    1. Creating a new resource that avoids the spurious capacity providers dependency chain (and allows capacity providers association with existing clusters), which is #22672
    2. Deprecating the capacity_providers and default_capacity_provider_strategy arguments of aws_ecs_cluster (#22754)
    3. Removing the capacity_providers and default_capacity_provider_strategy arguments from aws_ecs_cluster, which is a breaking change
  • While the complete solution includes a breaking change, that doesn’t prevent us from moving forward with i. and ii. (v4.0) above and then keeping iii. in mind for v5.0.

Terraform Version

Terraform v0.12.18

  • provider.aws v2.43.0

Affected Resource(s)

  • aws_ecs_cluster
  • aws_ecs_capacity_provider
  • aws_autoscaling_group

Terraform Configuration Files

resource "aws_ecs_cluster" "indestructable" { 
  name = "show_tf_cp_flaw"

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }
}

resource "aws_ecs_capacity_provider" "cp" {
  name = "show_tf_cp_flaw"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.asg.arn

    managed_scaling {
      status          = "ENABLED"
      target_capacity = 80
    }
  }
}

resource "aws_autoscaling_group" "asg" {
  min_size = 2
  ....
}

Debug Output

Panic Output

Expected Behavior

terraform destroy should be able to destroy an aws_ecs_cluster which has capacity_providers set.

Actual Behavior

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

The problem is that this new capacity_provider property on the aws_ecs_cluster introduces a new dependency: aws_ecs_cluster depends on aws_ecs_capacity_provider depends on aws_autoscaling_group

This causes terraform to destroy the ECS cluster before the autoscaling group, which is the wrong way around: the autoscaling group must be destroyed first because the cluster must contain zero instances before it can be destroyed.

A possible solution may be to introduce a new resource type representing the attachment of a capacity provider to a cluster (inspired by aws_iam_role_policy_attachment which is the attachment of an IAM policy to a role).

This would allow the following dependency graph which would work beautifully: aws_ecs_capacity_provider_cluster_attachment depends on aws_ecs_cluster and aws_ecs_capacity_provider; aws_ecs_capacity_provider depends on aws_autoscaling_group depends on aws_launch_template depends on aws_ecs_cluster (e.g. via the user_data property which needs to set the ECS_CLUSTER environment variable to the name of the cluster).

Steps to Reproduce

  1. terraform apply
  2. terraform destroy

Important Factoids

References

  • The problematic capacity_providers field on aws_ecs_cluster was added recently in #11150

  • Using aws_ecs_capacity_provider with managed_termination_protection = "ENABLED" requires that the aws_autoscaling_group has protect_from_scale_in enabled, which has a separate issue with destroy: #5278

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 106
  • Comments: 18 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Meanwhile here is a nasty workaround using a destroy provisioner, that worked for me to allow the aws_ecs_cluster to be destroyed:

resource "aws_ecs_cluster" "cluster" {
  name = local.cluster_name

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }

  # We need to terminate all instances before the cluster can be destroyed.
  # (Terraform would handle this automatically if the autoscaling group depended
  #  on the cluster, but we need to have the dependency in the reverse
  #  direction due to the capacity_providers field above).
  provisioner "local-exec" {
    when = destroy

    command = <<CMD
      # Get the list of capacity providers associated with this cluster
      CAP_PROVS="$(aws ecs describe-clusters --clusters "${self.arn}" \
        --query 'clusters[*].capacityProviders[*]' --output text)"

      # Now get the list of autoscaling groups from those capacity providers
      ASG_ARNS="$(aws ecs describe-capacity-providers \
        --capacity-providers "$CAP_PROVS" \
        --query 'capacityProviders[*].autoScalingGroupProvider.autoScalingGroupArn' \
        --output text)"

      if [ -n "$ASG_ARNS" ] && [ "$ASG_ARNS" != "None" ]
      then
        for ASG_ARN in $ASG_ARNS
        do
          ASG_NAME=$(echo $ASG_ARN | cut -d/ -f2-)

          # Set the autoscaling group size to zero
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name "$ASG_NAME" \
            --min-size 0 --max-size 0 --desired-capacity 0

          # Remove scale-in protection from all instances in the asg
          INSTANCES="$(aws autoscaling describe-auto-scaling-groups \
            --auto-scaling-group-names "$ASG_NAME" \
            --query 'AutoScalingGroups[*].Instances[*].InstanceId' \
            --output text)"
          aws autoscaling set-instance-protection --instance-ids $INSTANCES \
            --auto-scaling-group-name "$ASG_NAME" \
            --no-protected-from-scale-in
        done
      fi
CMD
  }
}

Hi all 👋 Just letting you know that this is issue is featured on this quarters roadmap. If a PR exists to close the issue a maintainer will review and either make changes directly, or work with the original author to get the contribution merged. If you have written a PR to resolve the issue please ensure the “Allow edits from maintainers” box is checked. Thanks for your patience and we are looking forward to getting this merged soon!

Thank you for the input on this issue! We are carefully considering work on this in the near future. (No guarantees on an exact date.) In order to facilitate the implementation, I’ve outlined some thoughts below.

After looking through this, I agree with the suggested way forward:

  • Create a new resource (something along the lines of aws_ecs_capacity_provider_cluster_attachment)
  • The new resource could overlap existing functionality. However, removing the capacity_providers and default_capacity_provider_strategy arguments would be a breaking change that would need to wait until a major release. Please comment below on whether ideally these would stay or go.
  • It would be highly desirable for this implementation to also fix #5278, #11351, and #11531, if possible.

Please provide any feedback, yay or nay.

Ah, turns out this is precisely the issue described in https://github.com/hashicorp/terraform-provider-aws/issues/11531. In short, the design of capacity providers is broken in Terraform right now, as it creates an invalid dependency chain: aws_ecs_cluster -> aws_ecs_capacity_provider -> aws_autoscaling_group. This chain isn’t valid, because on destroy, Terraform will try to delete aws_ecs_cluster first, but it can’t, because the aws_autoscaling_group hasn’t been deleted. So we need an aws_ecs_capacity_provider_attachment to use capacity providers without such a dependency chain.

@edmundcraske-bjss Yes, you are absolutely correct! Thank you.

At this point, the best way forward looks like #22672. That will address the op recommended solution of an attachment resource (though named aws_ecs_cluster_capacity_providers instead). That will solve the main problem here of using capacity providers but not being able to destroy the cluster. It will also solve the problem of not being able to associate capacity providers with existing clusters.

Is it possible that your test did not hit the issue because the EC2 instances were not actually registering with the ECS cluster?

Same as #4852 . Someone consolidate all these, this is really noisy.

Having this issue too. On destroy, I get the error:

Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

This started around Terraform 0.12, and we added retries to work around it. We’re now upgrading to 0.15, and the retries no longer seem to help, so this is a blocker.

Any news? I still waiting for this issue to be fixed

Any updates here? This is terribly annoying to deal with. (The workaround does not work in my particular case)