terraform-provider-aws: resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Terraform CLI and Terraform AWS Provider Version
terraform version - 0.12.31 provider-aws version - 3.54.0
Affected Resource(s)
- aws_route
Terraform Configuration Files
Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.
provider "aws" {
access_key = var.ACCESS_KEY_ID
secret_key = var.SECRET_ACCESS_KEY
region = "eu-west-1"
}
resource "aws_vpc" "vpc" {
cidr_block = "10.222.0.0/16"
enable_dns_support = true
enable_dns_hostnames = true
}
resource "aws_subnet" "public_utility_z0" {
vpc_id = aws_vpc.vpc.id
cidr_block = "10.222.96.0/26"
availability_zone = "eu-west-1a"
}
resource "aws_eip" "eip_natgw_z0" {
vpc = true
}
resource "aws_nat_gateway" "natgw_z0" {
allocation_id = aws_eip.eip_natgw_z0.id
subnet_id = aws_subnet.public_utility_z0.id
}
resource "aws_route_table" "routetable_private_utility_z0" {
vpc_id = aws_vpc.vpc.id
}
resource "aws_route" "private_utility_z0_nat" {
route_table_id = aws_route_table.routetable_private_utility_z0.id
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.natgw_z0.id
}
Debug Output
N/A
Panic Output
N/A
Expected Behavior
I would expect that terraform-provider-aws saves the route in the state and waits for it to get ready. And terraform apply fails when the route cannot be available (in 2m) but the route is saved in the state and no state is lost/leaked.
Or alternatively terraform-provider-aws does not save the route in the state but ensures that the corresponding route (that cannot be available in 2m) is deleted afterwards.
I don’t have much experience with handling of such cases, just a rough idea from my side.
Most probably the other resources handle this already in a better way and something similar is probably applicable for the aws_route.
For example I compared the subnet and route creation.
On subnet creation we call d.SetId(subnetId) (I think the part that ensures that the resources is saved in the state) right after the subnet create call and before the wait until the subnet gets ready. I think that with this handling we don’t lose the subnet state if for example the subnet cannot get ready.
On the other side, on route creation we call d.SetId(tfec2.RouteCreateID(routeTableID, destination)) after the wait until the route gets available. So if the route does not get available, we don’t save the resource state.
Actual Behavior
When applying the above partial terraform configuration, we notice that the aws_route is not saved in the state (“leaks”) when the route fails to be available (in 2m) on creation.
The corresponding failure is:
error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
354: resource "aws_route" "private_utility_z0_nat" {
Afterwards any subsequent terraform appy run fails with reason RouteAlreadyExists:
error creating Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0): RouteAlreadyExists: The route identified by 0.0.0.0/0 already exists.
status code: 400, request id: <omitted>
on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
354: resource "aws_route" "private_utility_z0_nat" {
Background: we have a lot of automation on top of terraform apply. Such inconsistencies cause us troubles because terraform apply itself is not able to recover from such state where a resource is leaked and new one cannot be created because of it. And such cases require manual intervention to fix the terraform state and the state in the infrastructure.
Steps to Reproduce
terraform apply
Ensure that the first terraform apply can potentially fail that the route cannot get available for 2m
error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
354: resource "aws_route" "private_utility_z0_nat" {
- Ensure that the corresponding route is leaked and any subsequent
terraform applyfails with reason RouteAlreadyExists.
Important Factoids
N/A
References
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 39
- Comments: 36 (6 by maintainers)
Commits related to this issue
- Increase route_table Create timeout From 2 minutes to 5 minutes to address https://github.com/hashicorp/terraform-provider-aws/issues/21032 — committed to huguesalary/terraform-provider-aws by huguesalary 3 years ago
- Increase route_table Create timeout From 2 minutes to 5 minutes to address https://github.com/hashicorp/terraform-provider-aws/issues/21032 — committed to huguesalary/terraform-provider-aws by huguesalary 3 years ago
yes. I can confirm that v3.66 has fixed this issue after @ialidzhikov changes. Big thanks to @ialidzhikov for updating route retries to 1000. cheers.
This is still the issue even with 1001 retries. The issue is that terraform although creates the route, but doesn’t update the state and keeps retrying until timeout and/or retries limit. Terraform should be able to check the route status with every retry & therefore should update the state correctly.
@cdancy unfortunately it didn’t help
@cdancy but then it’s clear that you are hit (because route creation often takes longer than the default 2m timeout). Your vpc timeout is not related to the aws_route timeout, the latter can be configured since 3.62.0, you could try that.