terraform-provider-google: Terraform state leaks when the GCE Operation API(s) rate limit is exceeded

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

$ terraform --version
Terraform v0.14.4
+ provider registry.terraform.io/hashicorp/google v3.59.0

Affected Resource(s)

I reproduced this issue with google_compute_firewall but it is applicable for all resources that use the Operations API (I believe most of the resources).

google_compute_firewall

Terraform Configuration Files

provider "google" {
  credentials = file("SERVICE_ACCOUNT.json")
  project     = var.PROJECT_ID
  region      = "europe-west1"
}

resource "google_compute_network" "network" {
  name                    = "foo"
  auto_create_subnetworks = "false"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_subnetwork" "subnetwork-nodes" {
  name          = "foo-nodes"
  ip_cidr_range = "10.250.0.0/16"
  network       = google_compute_network.network.name
  region        = "europe-west1"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_firewall" "rule-allow-internal-access" {
  name          = "allow-internal-access"
  network       = "foo"
  source_ranges = ["10.250.0.0/16"]

  allow {
    protocol = "icmp"
  }

  allow {
    protocol = "ipip"
  }

  allow {
    protocol = "tcp"
    ports    = ["1-65535"]
  }

  allow {
    protocol = "udp"
    ports    = ["1-65535"]
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

Debug Output

Not applicable

Panic Output

Not applicable

Expected Behavior

The Operation API has the following API rate limits:

Operation read requests - Limits for OperationsService.Get methods: Rate per project: 20 requests/second Rate per user: 20 requests/second

See GCE API rate limits for reference.

So the general the flow in terraform-provider-google and in any tool using the SDK is:

Send a request for resource creation to GCE
Fetch the operation ID from the response
Poll the Operation API until the operation is completed (usually results in many API calls to the Operation API)

As you can see from the flow, the Operation API is likely to be used much frequently than the separate resource APIs.

The expected behaviour is terraform-provider-google to be resilient to Operation API rate limits and to do not leak state in the very first call to the Operation API that fails.

As a side note, we use the terraform in a fully automated manner (with -auto-approve flag). So whenever such consistency issue occurs, our automation cannot recover from it. As you will see in the Actual behaviour section, the first terraform apply run fails, then any subsequent terraform apply fails with resource already exists error. So there is now way to recover from such error without manual interaction.

Actual Behavior

Currently terraform-provider-google is not resilient when the Operation API rate limit is exceeded.

Currently terraform-provider-google sends a POST request for resource creation and afterwards it fails right away if the Operation API is rate limited:

$ terraform apply -var-file=var.tfvars --auto-approve
google_compute_network.network: Refreshing state... [id=projects/foo-240012/global/networks/foo]
google_compute_subnetwork.subnetwork-nodes: Refreshing state... [id=projects/foo-240012/regions/europe-west1/subnetworks/foo-nodes]
google_compute_firewall.rule-allow-internal-access: Creating...

Error: Error waiting to create Firewall: Error waiting for Creating Firewall: error while retrieving operation: googleapi: Error 403: Quota exceeded for quota group 'OperationReadGroup' and limit 'Operation read requests per user per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:123'., rateLimitExceeded

The resource is actually created but it is not saved in the terraform.tfstate.

A subsequent run to apply the same config fails with:

$ terraform apply -var-file=var.tfvars --auto-approve
google_compute_network.network: Refreshing state... [id=projects/foo-240012/global/networks/foo]
google_compute_subnetwork.subnetwork-nodes: Refreshing state... [id=projects/foo-240012/regions/europe-west1/subnetworks/foo-nodes]
google_compute_firewall.rule-allow-internal-access: Creating...

Error: Error creating Firewall: googleapi: Error 409: The resource 'projects/foo-240012/global/firewalls/allow-internal-access' already exists, alreadyExists

Steps to Reproduce

Initially apply the following config

provider "google" {
  credentials = file("SERVICE_ACCOUNT.json")
  project     = var.PROJECT_ID
  region      = "europe-west1"
}

resource "google_compute_network" "network" {
  name                    = "foo"
  auto_create_subnetworks = "false"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_subnetwork" "subnetwork-nodes" {
  name          = "foo-nodes"
  ip_cidr_range = "10.250.0.0/16"
  network       = google_compute_network.network.name
  region        = "europe-west1"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

Only add the google_compute_firewall resource to main.tf

provider "google" {
  credentials = file("SERVICE_ACCOUNT.json")
  project     = var.PROJECT_ID
  region      = "europe-west1"
}

resource "google_compute_network" "network" {
  name                    = "foo"
  auto_create_subnetworks = "false"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_subnetwork" "subnetwork-nodes" {
  name          = "foo-nodes"
  ip_cidr_range = "10.250.0.0/16"
  network       = google_compute_network.network.name
  region        = "europe-west1"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

+resource "google_compute_firewall" "rule-allow-internal-access" {
+  name          = "allow-internal-access"
+  network       = "foo"
+  source_ranges = ["10.250.0.0/16"]
+
+  allow {
+    protocol = "icmp"
+  }
+
+  allow {
+    protocol = "ipip"
+  }
+
+  allow {
+    protocol = "tcp"
+    ports    = ["1-65535"]
+  }
+
+  allow {
+    protocol = "udp"
+    ports    = ["1-65535"]
+  }
+
+  timeouts {
+    create = "5m"
+    update = "5m"
+    delete = "5m"
+  }
+}

Start spamming the GCE Operation API to reproduce rate limit exceeded error.

I used the following sample script that spams the GCE Operation API in while true loop.

#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail

PROJECT_ID="<project-id>"
OPERATION_ID="<sample-operation-id>"

export GOOGLE_APPLICATION_CREDENTIALS="SERVICE_ACCOUNT.json"
BEARER_TOKEN="$(gcloud auth application-default print-access-token)"

while true; do
  curl -X GET \
    -H "Authorization: Bearer $BEARER_TOKEN" \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://compute.googleapis.com/compute/v1/projects/${PROJECT_ID}/global/operations/${OPERATION_ID}"
done

I had to run the script in parallel in ~10 separate sessions to finally start receiving the rate limit exceeded errors. Wait until the request start failing with PERMISSION_DENIED.

{
  "error": {
    "code": 403,
    "message": "Quota exceeded for quota group 'OperationReadGroup' and limit 'Operation read requests per user per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:123'.",
    "errors": [
      {
        "message": "Quota exceeded for quota group 'OperationReadGroup' and limit 'Operation read requests per user per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:123'.",
        "domain": "usageLimits",
        "reason": "rateLimitExceeded"
      }
    ],
    "status": "PERMISSION_DENIED"
  }
}

Run terraform apply to create the google_compute_firewall resource.
Make sure that the terraform apply will fail with rate limit exceeded error for the Operation API

$ terraform apply -var-file=creds.tfvars --auto-approve
google_compute_network.network: Refreshing state... [id=projects/foo-240012/global/networks/foo]
google_compute_subnetwork.subnetwork-nodes: Refreshing state... [id=projects/foo-240012/regions/europe-west1/subnetworks/foo-nodes]
google_compute_firewall.rule-allow-internal-access: Creating...

Error: Error waiting to create Firewall: Error waiting for Creating Firewall: error while retrieving operation: googleapi: Error 403: Quota exceeded for quota group 'OperationReadGroup' and limit 'Operation read requests per user per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:123'., rateLimitExceeded

Run terraform apply once again
Make sure that the second run fails with resource already exists error

$ terraform apply -var-file=creds.tfvars --auto-approve
google_compute_network.network: Refreshing state... [id=projects/foo-240012/global/networks/foo]
google_compute_subnetwork.subnetwork-nodes: Refreshing state... [id=projects/foo-240012/regions/europe-west1/subnetworks/foo-nodes]
google_compute_firewall.rule-allow-internal-access: Creating...

Error: Error creating Firewall: googleapi: Error 409: The resource 'projects/foo-240012/global/firewalls/allow-internal-access' already exists, alreadyExists

The firewall resource is created but not saved in the terraform state because of the Operation API error. That’s why any subsequent terraform apply would fail with resource already exists error.

Important Factoids

References

#0000

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 5
Comments: 15

Most upvoted comments

I was OOO for several days and I am back today. Do you still need the terraform debug log from my side?

No, I was able to replicate!

c2thorn on Mar 17, 2021

It’s possible to add a retry for this operation read quota exceeded error, and I think is the appropriate solution. Each subsequent retry will increase the wait time before trying again, which should allow the rate to lower. When this is added I recommend to continue to monitor your resource timeouts to accommodate the longer operation waits.

@ialidzhikov I’m sure you may already be aware, but terraform import offers an immediate recovery for the failed creation state. Not an ideal UX, but we’ll work on adding the retry so you won’t get to this point.

c2thorn on Mar 16, 2021

@ialidzhikov sorry there is not much we can do at the provider level. As I mentioned this fails on Refreshing state.. which is handled by Terraform code. If you have a chance to review the debug log, you might be able to see it has not hit the code you referenced above. I am closing the issue now. Please feel free to reopen the issue with more info. Thanks

edwardmedia on Mar 12, 2021