terraform-provider-aws: Intermittent network issues (read: connection reset Errors)

Terraform Version

Terraform v0.12.23

We’re running a drift detection workflow using github hosted github actions, which simply runs terraform plan and fails if it outputs anything. This runs on a schedule every hour. We’re getting request errors, causing terraform plan to fail, around 2-3 times a day

Some of the request errors we’ve so far encountered:

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/E26H********: read tcp 10.1.0.4:52046->54.239.29.26:443: read: connection reset by peer

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/E1B3D********: read tcp 10.1.0.4:33408->54.239.29.51:443: read: connection reset by peer

Error: error listing tags for CloudFront Distribution (E24R********): RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/tagging?Resource=arn%3Aaws%3Acloudfront%3A%3A*********%3Adistribution%2FE24********: read tcp 10.1.0.4:56918->54.239.29.65:443: read: connection reset by peer

Error: error getting S3 Bucket website configuration: RequestError: send request failed
caused by: Get https://******.s3.amazonaws.com/?website=: read tcp 10.1.0.4:59070->52.216.20.56:443: read: connection reset by peer

Error: error getting S3 Bucket replication: RequestError: send request failed
caused by: Get https://*******.s3.amazonaws.com/?replication=: read tcp 10.1.0.4:60534->52.216.138.67:443: read: connection reset by peer

Most of these seem to be CloudFront and S3

Thanks

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 58
  • Comments: 38 (12 by maintainers)

Most upvoted comments

I started seeing these today.

Error: Error reading IAM policy version arn:aws:iam::XXXX:policy/OktaChildAccountPolicy: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52180->52.94.225.3:443: read: connection reset by peer



Error: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52172->52.94.225.3:443: read: connection reset by peer



Error: Error reading IAM Role Okta-Idp-cross-account-role: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52171->52.94.225.3:443: read: connection reset by peer

https://status.aws.amazon.com/

1:50 PM PDT We are investigating increased error rates and latencies affecting IAM. IAM related requests to other AWS services may also be impacted.

There is definitely a problem on the AWS side If you go to the cloudfront console and hit refresh a few times, you’re now very likely to encounter this image

We haven’t had this happen for more than a week now Could be fixed on aws side?

Hi again 👋 Since it appears that this was handled on the AWS side (both in this issue and lack of Terraform support tickets), our preference will be to leave things as they are for now. If this comes up again, especially since CloudFront seems to very prominently have this issue when it occurs, we can definitely think more about this network connection handling. 👍

for what it’s worth - every time I’ve seen this issue it’s been on read calls to either Cloudfront distribution configs or Cloudfront origin access identities.

It may not be the best way to approach solving the issue, but given that the majority of the connection reset issues seem to be with specific Cloudfront read calls + a few others, it might be worth just adding retries to individual API calls (Cloudfront or otherwise) as they become problematic?

Seeing this same issue in Terraform Cloud, specifically with the cloudfront_distribution and cloudfront_origin_access_identity resources - it’s happening almost daily at this point.

We have a support ticket request open with aws for both this issue and https://github.com/terraform-providers/terraform-provider-aws/issues/14797 - especially in the latter case it would greatly help if TRACE would show complete requests + responses for us/aws to understand what is going on.

Or maybe even something separate like HTTP_TRACE that only shows requests + responses, which in most cases is the more interesting part when debugging these type of issues.

We are experiencing this issue on our Jenkins hosted on EC2 - we run multiple nodes behind a natgw (so shared IP for outgoing connections).

Error:

Error: error waiting until CloudFront Distribution (XXXXX) is deployed: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/XXXXX: read tcp 10.x.x.x:35832->54.x.x.x:443: read: connection reset by peer
Question Answer
Terraform Resource aws_cloudfront_distribution
Terraform Operation Read
AWS Service CloudFront
API Call GetDistribution
Terraform Environment AWS VPC (EC2, Concourse CI)
Terraform Concurrency 10 (default)
Known HTTP Proxy No
How Many Resources 1 in same configuration
How Often 80% of runs, 4/5 in 24h

Terraform 0.12.28, AWS provider 2.70.0

We’re experiencing the issue also in Terraform Cloud using v0.12.28 & 0.12.29 and the AWS provider pinned to ~> 2.0.

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/ABCD1234567: read tcp 10.181.43.96:56350->54.239.29.51:443: read: connection reset by peer

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer

seeing this on v0.12.29 as well

GPG-encrypted logs available at https://gist.github.com/mattburgess/2a00b1e77b00368781360ac8581383b9

analytical-dataset-generation_analytical-dataset-generation-qa_154.log.gpg - this one failed after seeing a single connection reset by peer error; no retries were attempted.

analytical-dataset-generation_analytical-dataset-generation-preprod_136.log.gpg - this one hung/paused/waited for 15 minutes having seen a connection reset by peer error, then retried and succeeded on its first retry.

Having the same issues in our CI/CD pipeline