terraform-provider-aws: Intermittent error using s3 state
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Terraform Version
Terraform 0.11.7 aws 1.17.0
Affected Resource(s)
- s3 backend
Terraform Configuration Files
(names have been changed to protect the innocent…)
provider "aws" {
region = "${var.Region}"
version = "~> 1.17.0"
}
terraform {
backend "s3" {
profile = "TheCloud"
bucket = "the-cloud-terraform-state"
key = "instance-terraform.tfstate"
region = "us-west-2"
workspace_key_prefix = "instances"
}
}
# Retrieve state data from S3
data "terraform_remote_state" "state" {
backend = "s3"
config {
profile = "TheCloud"
bucket = "the-cloud-terraform-state"
key = "instance-terraform.tfstate"
region = "us-west-2"
workspace_key_prefix = "instances"
}
}
...
Output
...
aws_launch_configuration.Instance: Refreshing state... (ID: instance-launch-config-20180531060133610900000001)
aws_autoscaling_group.Instance: Refreshing state... (ID: Instance-204aaaba-fb33-48e8-88b2-aa190e763b71)
Error: Error refreshing state: 1 error(s) occurred:
* data.terraform_remote_state.state: 1 error(s) occurred:
* data.terraform_remote_state.state: data.terraform_remote_state.state: error loading the remote state: RequestError: send request failed
caused by: Get https://the-cloud-terraform-state.s3.us-west-2.amazonaws.com/?prefix=instances%2F: dial tcp 54.231.177.36:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
...
Expected Behavior
Terraform to work as expected, consistently
Actual Behavior
Intermittent error talking to s3
Steps to Reproduce
Its intermittent 😦
terraform apply
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 81
- Comments: 24 (8 by maintainers)
+1 here, hitting lots of this, even with max_retries set to 10 in the aws provider block.
Also seeing this issue intermittently on an EC2 instance in the same region as the s3 bucket. 0.11.8, 1.36.0. Sometimes it is failing to load the main state file from s3 and other times it is failing to load a terraform_remote_state also from s3.
And occasionally states are failing to be uploaded which results in the state file and dynamo lock table getting out of sync and needing to be manually repaired:
The s3 backend code (that seems to be used by the terraform_remote_state datasource as well) appears to do some basic retrying but this EOF issue is either not getting caught by
s3ErrCodeInternalErroror is occurring twice in a row. It would be great to make the retry logic more intelligent with an exponential backoff of some sort. If needed I can try to repro this with more verbose logging so we can get the AWS error code.I have submitted a pull request upstream, which should resolve this: https://github.com/hashicorp/terraform/pull/19951
My case is on Terraform v0.11.10 and it is not intermitten. All terraform init command failed with same error
We’ve been experiencing the same issue in our CI pipeline, and have narrowed it down to two problems.
1. ListObjects doesn’t retry on any error
See https://github.com/hashicorp/terraform/blob/49d62d3a1b99abf65711f5c8fdf2396931044db3/backend/remote-state/s3/backend_state.go#L29
2. GetObject only retries on S3-specific errors
See https://github.com/hashicorp/terraform/blob/49d62d3a1b99abf65711f5c8fdf2396931044db3/backend/remote-state/s3/client.go#L104
/cc @bracki
For issues relating to version 1.34.0+ of the AWS provider
plan/applybeing much slower, you might be interested in this thread potentially related to DNS handling and Go 1.11 (v1.34.0 was the first release on Go 1.11): https://github.com/terraform-providers/terraform-provider-aws/issues/5822#issuecomment-424712521Same here. Yesterday was HORRIBLE with a ton of failures executing terragrunt apply-all across many projects, AWS accounts, and buckets. Things seem improved today, but not completely. We are using terraform 0.11.7, and aws provider 1.37.0. We have backed it down aws provider 1.33.0 as a precaution. Applies seem faster, but no guarantee it solved anything because of the intermittent nature of the issue.
Seeing this very often when using terragrunt with the
apply-allcommand that runs several Terraform instances in parallel across different modules. It’s so bad that I can’t do deployments from my own laptop anymore and have to instead deploy an EC2 instance andgit clonemy code there. Is it possible AWS/S3 is throttling? Or is there some concurrency issue with Go?