terraform-provider-aws: Expired STS token results in terraform to hang

If my STS token in ~/.aws/credentials is expired, when I invoke terraform apply, it will seemingly hang and become unresponsive, requiring two SIGINTs to quit. Trace logs show that it’s repeatedly calling sts:GetCallerIdentity which resulting in 403 Forbidden with an ExpiredToken code.

...
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Action=GetCallerIdentity&Version=2011-06-15
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: -----------------------------------------------------
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 2017/08/04 21:14:40 [DEBUG] [aws-sdk-go] DEBUG: Response sts/GetCallerIdentity Details:
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: ---[ RESPONSE ]--------------------------------------
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: HTTP/1.1 403 Forbidden
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Connection: close
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Content-Length: 297
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Content-Type: text/xml
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Date: Sat, 05 Aug 2017 04:14:39 GMT
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: X-Amzn-Requestid: 99b535db-7994-11e7-8d9e-e17db6dd7b22
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: -----------------------------------------------------
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 2017/08/04 21:14:40 [DEBUG] [aws-sdk-go] <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:   <Error>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:     <Type>Sender</Type>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:     <Code>ExpiredToken</Code>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:     <Message>The security token included in the request is expired</Message>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:   </Error>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:   <RequestId>99b535db-7994-11e7-8d9e-e17db6dd7b22</RequestId>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: </ErrorResponse>
...

Terraform Version

Terraform v0.10.0

Affected Resource(s)

N/A

Terraform Configuration Files

provider "aws" {
  region = "us-east-1"
}

resource "aws_security_group" "default" {
  # doesn't matter which resource(s) are used
  name = "foo"
}

Debug Output

See above. I can generate a full trace log if necessary.

Panic Output

N/A

Expected Behavior

What should have happened?

The authentication process should check for a ExpiredToken response code and either return an error or emit some message. I found that if I swap in a fresh unexpired token while terraform apply is in this (seemingly unresponsive) loop, it’ll work. If the token expires in the middle of a command, it would be nice to allow for the user to replace the token (i.e., by polling like it does now), but if the very first auth results in an ExpiredToken, then perhaps it would be appropriate to abort the command?

Actual Behavior

What actually happened?

Terraform seemed to hang, requiring two Ctrl-Cs to abort.

Steps to Reproduce

  1. terraform apply

Important Factoids

References

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 49
  • Comments: 21 (14 by maintainers)

Most upvoted comments

It would be really nice if this was handled simply by exiting and displaying the expired token message. Currently I have to kill the process and forcibly unlock the Terraform cloud workspace which is a lot of manual work for an error which could trivially be detected with no side-effects.

Reported as https://support.hashicorp.com/hc/en-us/requests/24845

aws sts get-caller-identity may be more lightweight in that context and for that purpose.

I also experience this issue, and as a result use the following workaround in my ~/.bash_profile:

terraform () {
	aws sts get-caller-identity > /dev/null && /usr/local/bin/terraform "$@"
}

Gotcha. I was confused because the main workaround/bug fix I see being discussed right now is having terraform error out when the credentials become invalid instead of just hanging, that doesn’t increase credential lifetime, it just improves error messages. How would you otherwise recover from a hanging TF process? (I agree the state is kept in RAM there, but that doesn’t seem very useful as it continues to bash its head against the AWS SDK with an expired credential 😃)

I too see this issue very often. I think terraform should just exit when it receives token expiry error. When we do SIGINT the state files are left with incomplete states, these are caused because some of the states where not updated in the state file because the creation complete response is not received by terraform yet.

in my opinion this is the expected behaviour. any workaround would open a security risk also, even if TF fails , the state is kept in ram, till you rerun apply and then it’s pushed to s3.

on other way, is to use aws-vault with --server mode, which will renew the credentials for you

I’m honestly not sure - the go SDK appears to support the AWS_PROFILE variable that the command line tools use, but I couldn’t see how to make it work with Terraform to use that, or whether that would be able to be handled by the SDK itself for renewal.

I’ve only used the AssumeRole method, so I’m not sure of the others - Looking at the ARNs returned, there may be some way to handle this if there is a consistent form that could be interpreted, eg *:sts:<account>:<mechanism>/<parameters> where <mechanism> controls the parameters:

* 'assume-role' => `<role>/<label>` (and the original user is given by the tail of the user-id)

though how you might recover the original credentials, I’m not sure - for me, at the command line, I use revert to the default profile then use the extracted parameters to re-request a token, but that won’t work if the token came from elsewhere.

The more I look at this (and say to myself ‘huh, I’ve had a quite naive approach’), the more I think that the only way to usefully deal with renewing the token is if TF itself gained it (using the necessary parameters) such that it can do so again when an expiry error happens. Maybe trying to get some support off Amazon to find out what they might expect for this sort of thing - presumably they have similar constraints when it comes to CloudFormation. I haven’t spoken to Amazon about such things, though.

I wholeheartedly agree that requiring the user to interact in the middle of the session probably isn’t a workable solution - I see one place where TF wins its place is in the CI/CD workflow, being the same mechanism, used to test the system during testing as for deployment. In such cases, just as you hope for production use, you hope to leave it alone and let it get on with the job 😦

One other ‘interesting’ real world case of problems that I had was that the token expired after requesting the building of a number of expensive EC2 systems… and then because it couldn’t store the results in the S3 bucket (and I didn’t think to look at the backup files), it never recorded the instance ids. There were just a few extra EC2 systems running when I came to review the state of an account a few days later. Because the expiry happened before the tags were asserted (as the tag assignment is done after the creation for some operations), finding the reason those systems had been created was slightly more difficult - they didn’t have names, or associated ‘purpose’ or other tags that we use for tracking.

I hope it’s not felt that I’ve hijacked a thread with a variation on the issue, but seeing that someone else suffered from the same things I have spurred me to give my own experience in the hope that someone knows a way to go.

I think this behaviour is kind of intended because Terraform does not have a way to lease a new (valid) token and expects the user to do it and carry on once it does.

With that said we could (and probably should) give the user some feedback in the UI when this occurs instead of blindly retrying behind the scenes.