terraform-provider-google: instance_group_manager marked tainted if healthcheck failing

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.0.3

Affected Resource(s)

  • google_compute_region_instance_group_manager
  • google_compute_instance_group_manager

Terraform Configuration Files

I’m deploying a typical MIG, but with wait_for_instances = true:

resource "google_compute_instance_template" "my_app" {
  project = google_compute_subnetwork.primary_region.project
  region  = google_compute_subnetwork.primary_region.region

  name_prefix  = "my-app-"
  machine_type = "n1-standard-1"

  disk {
    boot         = true
    source_image = "cos-cloud/cos-stable"
    disk_type    = "pd-ssd"
    disk_size_gb = 40
  }

  network_interface {
    subnetwork = google_compute_subnetwork.primary_region.self_link
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_health_check" "my_app" {
  project = google_compute_instance_template.my_app.project
  name    = "my-app"

  check_interval_sec  = 10
  timeout_sec         = 5
  unhealthy_threshold = 5

  http_health_check {
    port         = 80
    request_path = "/-/health"
  }
}

resource "google_compute_region_instance_group_manager" "my_app" {
  project = google_compute_instance_template.my_app.project
  region  = google_compute_instance_template.my_app.region

  name               = "my-app"
  base_instance_name = "my-app"

  version {
    instance_template = google_compute_instance_template.my_app.id
  }

  target_size        = 1
  wait_for_instances = false

  named_port {
    name = "http"
    port = 80
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.my_app.self_link
    initial_delay_sec = 30
  }
}

Debug Output

https://gist.github.com/dv-stephen/610fafba3eddd0de9e941ee6fa7e13bd

Expected Behavior

If there is an issue with the MIG such as a bad health check or faulty instance config that prevents the MIG from reaching a healthy state, terraform should be able to refresh the resource and allow code changes to fix the MIG.

Actual Behavior

Terraform hangs on the refresh phase of the MIG resource, waiting for the MIG to become healthy which never happens. The only solution is manual intervention, preventing a GitOps model with changes being made through code.

Steps to Reproduce

  1. Deploy a MIG with wait_for_instances = true and a health check that will fail
  2. The terraform run will timeout waiting for the MIG to become healthy
  3. Run terraform apply again which will timeout during the refresh phase

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23 (1 by maintainers)

Most upvoted comments

@dv-stephen: While https://github.com/hashicorp/terraform-provider-google/issues/9657 isn’t directly related (the problem there is that the user specified the wrong format and the API is behaving badly) you’re correct that Terraform is incorrectly sending requests to projects/projects/{{project}} despite the correct value being in use in your config.

I’ve spun out https://github.com/hashicorp/terraform-provider-google/issues/9722 to cover investigating that. We hadn’t noticed the issue because the API was behaving correctly despite that- I suspect a change to the client library we use is the root cause. That said, I don’t believe that error has an effect on the instance group manager behaviour here, so we can probably isolate the two discussions / fixes.

@dv-stephen agreed. We shouldn’t be doing this polling during read since it can result a broken refresh. I’ll make the change to do this polling during create/update with an increased timeout.