configure-aws-credentials: Failures occur with OIDC Method During Parallel Requests

I have a repo in which I am testing out the new GitHub OIDC provider functionality. I have about 20 workflows in the repo. Typically these workflows are run individually since they are configured to be triggered by different events. However, in my particular scenario I have a PR open which is making changes to all of these workflows, thus, they are all triggering at the same time. When this occurs I see the following error in some, but not all of the workflows:

Couldn’t retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements

Screen Shot 2021-10-27 at 2 26 04 AM

Simply clicking the Re-run all jobs button causes the error to resolve itself when the workflow runs a second time.

Are there any known limits with how many workflows can be run in parallel with the GitHub OIDC provider? Or is this a bug?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 27
  • Comments: 26 (3 by maintainers)

Commits related to this issue

Most upvoted comments

OIDC fix has been released, if you have any questions then feel free to reach out to us. Thank you for your patience 🙏🏼

This has been a huge blocker for us, we regularly need to deploy a number of stacks in parallel. The previous solution was not robust enough, so we rolled our own credentials action with retry + exponential backoff. If you’re interested, I’m including it here. Note, we used a hard-coded upper limit of retries.

name: AWS OIDC Credentials
description: |
  Gets OIDC credentials from AWS, implementing retry + exponential backoff
inputs:
  role-to-assume:
    description: 'Role to assume'
    required: true
  session-name:
    description: 'Session name'
    required: true
    default: GithubOIDC
  aws-region:
    description: 'AWS region'
    default: eu-west-1
    required: true
runs:
  using: composite
  steps:
      - name: Grab credentials
        shell: bash
        run: |
          WEB_TOKEN=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')
          count=0
          backoff=1
          until ids=$(aws sts assume-role-with-web-identity \
            --role-arn ${{ inputs.role-to-assume }} \
            --role-session-name ${{ inputs.session-name }} \
            --web-identity-token $WEB_TOKEN \
            --region ${{ inputs.aws-region }} ) || (( count++ >= 5 )); do echo "Retrying: $backoff"; sleep $backoff; (( backoff*=2 )); done

          ids=$(echo $ids | jq -r .Credentials)
          echo "::add-mask::$(echo $ids | jq -r .AccessKeyId)"
          echo "::add-mask::$(echo $ids | jq -r .SessionToken)"
          echo "::add-mask::$(echo $ids | jq -r .SecretAccessKey)"
          echo "AWS_ACCESS_KEY_ID=$(echo $ids | jq -r .AccessKeyId)" >> $GITHUB_ENV
          echo "AWS_SESSION_TOKEN=$(echo $ids | jq -r .SessionToken)" >> $GITHUB_ENV
          echo "AWS_SECRET_ACCESS_KEY=$(echo $ids | jq -r .SecretAccessKey)" >> $GITHUB_ENV
          region=${{ inputs.aws-region }}
          echo "AWS_REGION=$region" >> $GITHUB_ENV
          echo "AWS_DEFAULT_REGION=$region" >> $GITHUB_ENV

I’m currently testing backoff retry logic with async-retry npm package. Seems to be working … https://github.com/blz-ea/configure-aws-credentials/blob/feature/role-ec2-identity/index.js#L433-L461

I tried to spread the OIDC login across 10x identical IAM roles. It did not help. It looks like the rate-limit is per AWS account and does not include the role.

This is the minimal reproducer:

name: test1

on:
  push:

permissions:
  id-token: write

jobs:

  test1:
    strategy:
      matrix:
        foo: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
      fail-fast: false

    runs-on: ubuntu-latest
    steps:

      - name: Login to AWS via Github OIDC
        uses: aws-actions/configure-aws-credentials@v1
        with:
          role-to-assume: <role-arn>
          aws-region: eu-west-1
          mask-aws-account-id: no

      - run: aws sts get-caller-identity

From the 40 jobs, the success rate is only ~10-20 jobs.

I met same issue and avoid via reuse aws auth in workflow. This is another approch to avoid retry attempt too long or failed.

Here is how I add the reuse. AWS Auth action run before any jobs. Then aws auth required job can run parallel.

https://gist.github.com/guitarrapc/bb279d0a0be2b229501a673980f96280

I’ve come up with a work around that is fairly reliable for this issue by adding conditional retry steps in my workflows. Ideally this is something that would be handled directly by this action.

Here is how I add the conditional retries. Between each invocation of the aws-actions/configure-aws-credentials action is a step which performs a sleep for a random amount of time between 15 and 65 seconds. The idea with the random sleep is this will hopefully cause the various workflows which are running in parallel to retry at different points in time.

- name: Configure AWS Credentials
  id: aws-auth
  continue-on-error: true
  uses: aws-actions/configure-aws-credentials@e21f7333e801ca751f058cc52de17f0ee6e1da6f
  with:
    aws-region: us-west-2
    mask-aws-account-id: false
    role-to-assume: arn:aws:iam::123456789012:role/myrole

- name: Sleep
  if: steps.aws-auth.outcome != 'success'
  run: sleep $[ ( $RANDOM % 50 )  + 15 ]s

- name: Configure AWS Credentials Retry
  id: aws-auth2
  continue-on-error: true
  if: steps.aws-auth.outcome != 'success'
  uses: aws-actions/configure-aws-credentials@e21f7333e801ca751f058cc52de17f0ee6e1da6f
  with:
    aws-region: us-west-2
    mask-aws-account-id: false
    role-to-assume: arn:aws:iam::123456789012:role/myrole

- name: Sleep
  if: steps.aws-auth.outcome != 'success' && steps.aws-auth2.outcome != 'success'
  run: sleep $[ ( $RANDOM % 50 )  + 15 ]s

- name: Configure AWS Credentials Retry
  id: aws-auth3
  if: steps.aws-auth.outcome != 'success' && steps.aws-auth2.outcome != 'success'
  uses: aws-actions/configure-aws-credentials@e21f7333e801ca751f058cc52de17f0ee6e1da6f
  with:
    aws-region: us-west-2
    mask-aws-account-id: false
    role-to-assume: arn:aws:iam::123456789012:role/myrole

This example only tries 3 times to authenticate. In theory any number of attempts could be done, but as you can see it gets quite verbose adding additional retries.

Great solution! I would recommend adding a check to validate if the aws sts assume-role-with-web-identity command succeeds. In my workflow even with 5 retries (the way you have your example setup) I still get occasional failures, but without a validation this step succeeds even though the role was not assumed. Something like this works for me:

name: AWS OIDC Credentials
description: |
  Gets OIDC credentials from AWS, implementing retry + exponential backoff
inputs:
  role-to-assume:
    description: 'Role to assume'
    required: true
  session-name:
    description: 'Session name'
    required: true
    default: GithubOIDC
  aws-region:
    description: 'AWS region'
    default: eu-west-1
    required: true
runs:
  using: composite
  steps:
      - name: Grab credentials
        shell: bash
        run: |
          WEB_TOKEN=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')
          count=0
          backoff=1
          until ids=$(aws sts assume-role-with-web-identity \
            --role-arn ${{ inputs.role-to-assume }} \
            --role-session-name ${{ inputs.session-name }} \
            --web-identity-token $WEB_TOKEN \
            --region ${{ inputs.aws-region }} ) || (( count++ >= 5 )); do echo "Retrying: $backoff"; sleep $backoff; (( backoff*=2 )); done

          if [ -z "$ids" ]; then
            echo "failed to assume IAM role after retries"
            exit 1
          fi

          ids=$(echo $ids | jq -r .Credentials)
          echo "::add-mask::$(echo $ids | jq -r .AccessKeyId)"
          echo "::add-mask::$(echo $ids | jq -r .SessionToken)"
          echo "::add-mask::$(echo $ids | jq -r .SecretAccessKey)"
          echo "AWS_ACCESS_KEY_ID=$(echo $ids | jq -r .AccessKeyId)" >> $GITHUB_ENV
          echo "AWS_SESSION_TOKEN=$(echo $ids | jq -r .SessionToken)" >> $GITHUB_ENV
          echo "AWS_SECRET_ACCESS_KEY=$(echo $ids | jq -r .SecretAccessKey)" >> $GITHUB_ENV
          region=${{ inputs.aws-region }}
          echo "AWS_REGION=$region" >> $GITHUB_ENV
          echo "AWS_DEFAULT_REGION=$region" >> $GITHUB_ENV

Just as another possibility of a workaround (that does not deal with the cross-account issue that we expect is happening), let me add this here. For a large job matrix, you can add a sleep based upon the index of the job, i.e.:

      - name: Sleep, Data, Sleep
        run: sleep $[ ( ${{ strategy.job-index }} + 1)*3 ]s
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
     ...

YMMV, but it may be an easy enough of a solution to get us through until this issue is dealt with.

I’m convinced that it is on AWS, or between AWS and Github. It is IMO not a problem of getting the JWT token from Github.

You are probably right, I am leaning towards between AWS and Github. IIRC the flow used here does have a server-to-server validation step. Adding the stack trace output (via setting the SHOW_STACK_TRACE: 'true' in the env variables) adds the additional information that the STS API returns a InvalidIdentityToken error (see e.g. here). In the issue you linked, there’s mention of a timeout of 5 secs (presumably on Github’s OIDC server response in the validation step), though I’m a little skeptical this is what’s happening here, because then I would expect an error like IDPCommunicationError.

I think the best course of action is to implement retries in configure-aws-credentials action. I’m not sure if you can distinguish between “denied by rate-limit” from “denied by some other cause”. Therefore, the action should fail on first error as it is now. And only when this retry is explicitly enabled by caller, it should retry failed calls.

According to the docs referenced above, a number of errors could be classified as retry-able. These could be used to decide to retry or not (perhaps with exponential backoff).

For us, though, the only viable path forward for now (other than dropping back to using tokens in GH Secrets) is to obtain the credentials once in a step, and export them using job outputs/needs, etc…

You are right. Although token will be disable in 900sec (minimum) or 1hour (default), caching senstive data is no recommended. Looking into https://github.com/actions/runner/issues/1466#issuecomment-966122092 and if it is JS SDK issue, I hope configure-aws-credentials will fix 10+ parallel requests limitation.

I met same issue and avoid via reuse aws auth in workflow. This is another approch to avoid retry attempt too long or failed.

Here is how I add the reuse. AWS Auth action run before any jobs. Then aws auth required job can run parallel.

https://gist.github.com/guitarrapc/bb279d0a0be2b229501a673980f96280

I like this solution for solving the scenario where multiple jobs are running in parallel and each job needs to authenticate. However, I don’t know if it is considered good practice to store credentials in the github actions cache. There is a disclaimer here that states that sensitive data should NOT be stored in the cache for public repos. There is no recommendation for private repos.