terraform-provider-aws: resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

terraform version - 0.12.31 provider-aws version - 3.54.0

Affected Resource(s)

  • aws_route

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

provider "aws" {
  access_key = var.ACCESS_KEY_ID
  secret_key = var.SECRET_ACCESS_KEY
  region     = "eu-west-1"
}

resource "aws_vpc" "vpc" {
  cidr_block           = "10.222.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true
}

resource "aws_subnet" "public_utility_z0" {
  vpc_id            = aws_vpc.vpc.id
  cidr_block        = "10.222.96.0/26"
  availability_zone = "eu-west-1a"
}

resource "aws_eip" "eip_natgw_z0" {
  vpc = true
}

resource "aws_nat_gateway" "natgw_z0" {
  allocation_id = aws_eip.eip_natgw_z0.id
  subnet_id     = aws_subnet.public_utility_z0.id
}

resource "aws_route_table" "routetable_private_utility_z0" {
  vpc_id = aws_vpc.vpc.id
}

resource "aws_route" "private_utility_z0_nat" {
  route_table_id         = aws_route_table.routetable_private_utility_z0.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.natgw_z0.id
}

Debug Output

N/A

Panic Output

N/A

Expected Behavior

I would expect that terraform-provider-aws saves the route in the state and waits for it to get ready. And terraform apply fails when the route cannot be available (in 2m) but the route is saved in the state and no state is lost/leaked. Or alternatively terraform-provider-aws does not save the route in the state but ensures that the corresponding route (that cannot be available in 2m) is deleted afterwards. I don’t have much experience with handling of such cases, just a rough idea from my side.

Most probably the other resources handle this already in a better way and something similar is probably applicable for the aws_route.

For example I compared the subnet and route creation.

On subnet creation we call d.SetId(subnetId) (I think the part that ensures that the resources is saved in the state) right after the subnet create call and before the wait until the subnet gets ready. I think that with this handling we don’t lose the subnet state if for example the subnet cannot get ready.

https://github.com/hashicorp/terraform-provider-aws/blob/9ff6122a9835dcad14bfc22828cb7a4cf1269b66/aws/resource_aws_subnet.go#L150-L176

On the other side, on route creation we call d.SetId(tfec2.RouteCreateID(routeTableID, destination)) after the wait until the route gets available. So if the route does not get available, we don’t save the resource state.

https://github.com/hashicorp/terraform-provider-aws/blob/9ff6122a9835dcad14bfc22828cb7a4cf1269b66/aws/resource_aws_route.go#L233-L253

Actual Behavior

When applying the above partial terraform configuration, we notice that the aws_route is not saved in the state (“leaks”) when the route fails to be available (in 2m) on creation.

The corresponding failure is:

error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {

Afterwards any subsequent terraform appy run fails with reason RouteAlreadyExists:

error creating Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0): RouteAlreadyExists: The route identified by 0.0.0.0/0 already exists.
	status code: 400, request id: <omitted>
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {

Background: we have a lot of automation on top of terraform apply. Such inconsistencies cause us troubles because terraform apply itself is not able to recover from such state where a resource is leaked and new one cannot be created because of it. And such cases require manual intervention to fix the terraform state and the state in the infrastructure.

Steps to Reproduce

  1. terraform apply

Ensure that the first terraform apply can potentially fail that the route cannot get available for 2m

error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {
  1. Ensure that the corresponding route is leaked and any subsequent terraform apply fails with reason RouteAlreadyExists.

Important Factoids

N/A

References

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 39
  • Comments: 36 (6 by maintainers)

Commits related to this issue

Most upvoted comments

yes. I can confirm that v3.66 has fixed this issue after @ialidzhikov changes. Big thanks to @ialidzhikov for updating route retries to 1000. cheers.

This is still the issue even with 1001 retries. The issue is that terraform although creates the route, but doesn’t update the state and keeps retrying until timeout and/or retries limit. Terraform should be able to check the route status with every retry & therefore should update the state correctly.

@cdancy unfortunately it didn’t help

@cdancy but then it’s clear that you are hit (because route creation often takes longer than the default 2m timeout). Your vpc timeout is not related to the aws_route timeout, the latter can be configured since 3.62.0, you could try that.