terraform-provider-github: Slow performance when managing dozens of repositories

Terraform Version

0.12.6

Affected Resource(s)

Please list the resources as a list, for example:

  • github_repository
  • github_branch_protection
  • github_team_repository
  • github_actions_secret

Terraform Configuration Files

Here’s our repo module (slightly redacted ****):

terraform {
  required_providers {
    github = ">= 3.1.0"
  }
}

locals {
  # Terraform modules must be named `terraform-<provider>-<module name>`
  # so we can extract the provider easily
  provider = element(split("-", var.repository), 1)
}

data "github_team" "****" {
  slug = "****"
}

data "github_team" "****" {
  slug = "****"
}

resource "github_repository" "main" {
  name        = var.repository
  description = var.description

  visibility = var.visibility

  topics = [
    "terraform",
    "terraform-module",
    "terraform-${local.provider}"
  ]

  has_issues   = var.has_issues
  has_projects = var.has_projects
  has_wiki     = var.has_wiki

  vulnerability_alerts   = true
  delete_branch_on_merge = true

  archived = var.archived

  dynamic "template" {
    for_each = var.fork ? [] : [var.fork]

    content {
      owner      = "waveaccounting"
      repository = "****"
    }
  }
}

resource "github_branch_protection" "main" {
  repository_id = github_repository.main.node_id
  pattern       = github_repository.main.default_branch

  required_status_checks {
    strict = true
    contexts = [
      "Terraform",
      "docs",
    ]
  }

  required_pull_request_reviews {
    dismiss_stale_reviews      = true
    require_code_owner_reviews = true
  }
}

resource "github_team_repository" "****" {
  team_id    = data.github_team.****.id
  repository = github_repository.main.name
  permission = "admin"
}

resource "github_team_repository" "****" {
  team_id    = data.github_team.****.id
  repository = github_repository.main.name
  permission = "admin"
}

resource "github_actions_secret" "secrets" {
  for_each = var.secrets

  repository      = github_repository.main.name
  secret_name     = each.key
  plaintext_value = each.value
}

Actual Behavior

We are managing approximately 90 repositories using this module via Terraform Cloud remote operations (which means we can’t disable refresh or change parallelization afaik). I timed a refresh + plan: 9m22s (562s) == 6.2s per repository

Are there any optimizations we can make on our side or in the github provider / API to try to improve this? We’re discussing breaking up our repos into smaller workspaces, but that feels like a bit of a hack.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform plan on large numbers of repositories / branch protection configs

Important Factoids

  • Running on Terraform Cloud Remote Operation

References

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 60
  • Comments: 40 (15 by maintainers)

Most upvoted comments

This is very much still a thing, we recently rev’d to latest in our infra and it seems to have only gotten worse 😦

👋 Hey Friends, this issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Please add the Status: Pinned label if you feel that this issue needs to remain open/active. Thank you for your contributions and help in keeping things tidy!

Can we restore the original codepath that did not use graphql? Many of us are going to be abandoned in v2.9.2 as a result of this performance problem. I’m not sure what benefit graphql has to the end user other, I’m sure it’ll be faster someday without the hundreds of REST calls, but it’s a noop at best from my perspective as a user, and github’s problem not “mine” as long as we stay under the API call limit.

Sorry to be grumpy but this is a pretty serious issue. Terraform sometimes times out waiting.

One thing that worked from me and reduced the time from 10min to 1.5min to plan/apply (50repos - 3teams/repo) was to remove the data “github_team”.

@mwarkentin use the team slug directly instead of doing a data and using it’s output.

I know it’s a good approach to use the “data” as a safety measure but in this case it will reduce the plan/apply in 90%

resource "github_team_repository" "****" {
  team_id    = <team_slug>
  repository = github_repository.main.name
  permission = "admin"
}

+1 – thanks to everyone taking this issue seriously. We also manage 55 repos, and I was hoping to one day manage about 400, but this has slowed us down too much to consider upgrading past 2.9.2.

FWIW a targeted plan of a single repo resources and with five branch protections, WITHOUT refresh and WITHOUT lock, took 15 seconds of real time:

Executed in   15.68 secs   fish           external
   usr time    3.01 secs  111.00 micros    3.01 secs
   sys time    1.36 secs  1250.00 micros    1.36 secs
took 15s

I’d love for graphql to cut the thousands and thousands of REST calls we do per plan down, but not for such an increase in time. I suspect some of this work may be moot/pre-optimization given the refresh improvements in 0.14.0 (as I understand them)–perhaps we could go back to REST entirely for these calls until the issues are sorted?

A full plan without refresh and lock took:

Executed in  207.09 secs   fish           external
   usr time   10.18 secs  120.00 micros   10.18 secs
   sys time    5.01 secs  1060.00 micros    5.01 secs
took 3m27s

And a full plan with refresh and lock:

Executed in  524.93 secs   fish           external
   usr time   17.41 secs  163.00 micros   17.41 secs
   sys time   10.02 secs  1487.00 micros   10.02 secs
took 8m44s

For comparison, a full plan with refresh and lock in 2.9.2:

Executed in  309.87 secs   fish           external
   usr time   15.15 secs  107.00 micros   15.15 secs
   sys time    8.19 secs  893.00 micros    8.19 secs
took 5m9s

So on my machine, 3.0+ is 42% slower in a big plan, and in my testing, 300% slower in a single repo with five branches to protect. For reference, I’m on a 2019 MacBook Pro, maxed out, on a 75MB symmetric connection.

We’re seeing it take ~10 minutes for our 331 repos plus associated Buildkite pipelines:

github v4.10.1

$ terraform state list | cut -d. -f3 | sort | uniq -c
    21
    166 buildkite_pipeline
     27 buildkite_pipeline_schedule
    169 github_branch_default
    331 github_repository
    159 github_repository_webhook
      5 github_team
     18 github_team_membership
    819 github_team_repository

Adding -parallelism=50 to terraform plan didn’t seem to make a difference.

Edit: That’s ~1.81s per repo, which doesn’t seem like much, but it adds up

Edit 2: I’ve realised we’re on an older version of Terraform, v0.14.8. I’ll try on the latest

Edit 3: No dice; just as slow on Terraform v0.15.5

@jcudit github_branch_protection and github_repository_file is where I’ve seen the massive speed and rate limit issues

We have thousands of repositories (split into hundreds of separate builds), and this is by far the biggest bottleneck in our automation. This has impacts such as:

  • reduced productivity - we spend a lot of time waiting for builds, that’s significantly reducing our engineering velocity.
  • monetary cost - we are paying considerably more since we’re paying for the time spent doing builds, and this is incurring tens of thousands of additional build minutes per month.

Right now, I cannot think of a bigger improvement (provider-wise) than addressing this performance issue. I really hope ythis issue would be prioritised higher - for our organisation, being to easily automate managing our repositories was a big driver from moving from Bitbucket to GitHub. Now we’re here on GitHub, and it’s better, but that issue is a big thorn for us.

Issue opened on Oct 15, 2020, just sayin’ =P

Update: using github_organization_teams with some transformation logic in my locals provides a similar reduction to my build time (assuming that it’s only called once) compared to many individual github_team objects.

Map indexed by team slug:

data "github_organization_teams" "all_teams" {}

locals {
  github_teams = {
    for team in data.github_organization_teams.all_teams.teams : team.slug => team
  }
}

Ya this has been a challenge with the provider from the very beginning. I had to design my states around things like aggressive rate limits, poorly optimized queries, lack of cache, etc on top of larger organization scale. I think the best solution at the moment is to split into smaller states based on some criteria such as team ownership.

Given the results above, here are some recommendations:

  • Refine the GraphQL query to be less costly (current shown below)
  • Revert the github_branch_protection back to the REST implementation, but handle unsupported configurations with the GraphQL implementation
  • Continue as-is and have users maintain smaller state files allowing higher parallelization
  • Revert the github_branch_protection back to the REST implementation, forfeiting the features available via the GraphQL implementation

I will attempt to get the first option into the upcoming bugfix release.

The second option is larger than I have time for but welcome PRs from contributors.

The last two options will need feedback / sign-off from the community. Interested to hear more opinions here if the first two options do not play out well.


current query

{
  "query": "query($id:ID!){node(id: $id){... on BranchProtectionRule{repository{id,name},pushAllowances(first: 100){nodes{actor{... on Team{id,name},... on User{id,name}}}},reviewDismissalAllowances(first: 100){nodes{actor{... on Team{id,name},... on User{id,name}}}},dismissesStaleReviews,id,isAdminEnforced,pattern,requiredApprovingReviewCount,requiredStatusCheckContexts,requiresApprovingReviews,requiresCodeOwnerReviews,requiresCommitSignatures,requiresStatusChecks,requiresStrictStatusChecks,restrictsPushes,restrictsReviewDismissals}}}",
  "variables": {
   "id": "MDIwOkJyYW5jaFByb3RlY3Rpb25SdWxlMTgwMjU3Njc="
  }
}

Thanks for confirming. Adding github_branch_protection to the test above reveals this poor performance.

for i in $(seq 1 100); do cat <<EOF >> main.tf
  resource "github_repository" "repo${i}" {
    name      = "repo${i}"
    auto_init = true
  }
  resource "github_branch_protection" "repo${i}" {
    repository_id = github_repository.repo${i}.node_id
    pattern       = github_repository.repo${i}.default_branch
    required_status_checks {
      strict = true
      contexts = [
        "Terraform",
        "docs",
      ]
    }
    required_pull_request_reviews {
      dismiss_stale_reviews      = true
      require_code_owner_reviews = true
    }
  }
EOF
done

Will focus efforts there and will track github_repository_file in https://github.com/terraform-providers/terraform-provider-github/issues/568 🙇

I see your point @schans, actually there are tens of workarounds, but the point is that this provider should work, this issue as explained by @restless-orca is defeating the whole purpose of the provider itself. I should not be put in a position to implement workarounds in order to make it work in a satisfying way. Keep in mind we are talking about an issue that impacts the core of the provider, not a specific feature.

We have 300+ repos with rulesets and it takes ages to work with it. About 15mins for the plan to run…

In our case we have 11 different github_release data sources. After few terraform plan/apply runs it’s going to take at least 15 minutes to pull data for these 11 data sources. TF_LOG=trace reveals these messages from the provider:

2023-09-29T20:37:50.612+0300 [DEBUG] provider.terraform-provider-github_v5.38.0: 2023/09/29 20:37:50 [DEBUG] Rate limit 60 reached, sleeping for 8m26.392054534s (until 2023-09-29 20:46:17.000000616 +0300 EEST m=+507.646272717) before retrying

So I guess we can just look for this string in the source and throw it away together with any wait/retry code around it?..

Or should we use GitHub auth for the plugin to increase these limits?..

Adding on - 26 minute run to manage 150 repos. Using summary_only = true reduced this to 14 minutes, and using github_branch_protection_v3 reduced it to a few minutes.

Just adding my 2c in here, I had to do a fully targeted plan to catch some drift on our very large Terraform workspace. Normally we overcome the performance by doing a targeted apply based on what changes. This comes with limitations of traversing modules etc.

In the case yesterday, I thought “I’ll have a beer while this plan chugs along” and it was me who ended up doing all the chugging. ~30-45 mins to plan 3500 resources.

I manage about ~100 repos in my personal account, and I keep hitting rate limit issues, I’m using:

github_repository, github_branch_default, github_branch_protection, github_actions_secret only…

anybody figured work arounds / optimizations for the rate limit issues? this was never an issue before the GraphQL refactor of this provider…

@RuiSMagalhaes THANK YOU ❤️ I can confirm this is the case.

I don’t get why the data source is so slow but this helps a lot 😍 🚤 🐎

Hi all,

To share our current status as this might work for others as well. We have implemented a workaround or maybe better labelled a trade-off and for us this solution is now “fast” for daily operations without the need to split up in multiple projects.

What we did:

  • split to the refresh to a separate task and run that through a cron job a couple of times per day (3 or 4). This is the task that takes over 10 minutes for us currently but we’re planning to add another couple of hundred repos
  • run the plan and apply with “-refresh=false”. These jobs now take around 10 to 30 seconds

The trade-off here is that there is a risk that the state is not up to date and the plan and/or apply might run into errors. For us this is acceptable since:

  • we do not allow direct changes in Github
  • use a s3 state backend
  • have an automated process with ci/cd to run the refresh, plan and apply tasks

HTH

We’re currently close to 500 repos and on the point of moving away from using terraform to manage because runs take over 10 minutes.

The only way I can see this dramatically improve is if all data if fetched at once from the GitHub (GraphQL) api (and cached), because the large amount api calls will always be slow. We have a simple python script that checks for repos not managed by terraform that does this and that runs in 16s. But I don’t know if this approach breaks the “terraform model” 🤷

I’m really hoping #395 will be prioritized, as aside from the security improvements it would allow for a combinatoric decrease in the number of resources needed to manage a large number of repos.

@jcudit if it’s still helpful:

❯ terraform state list | cut -d. -f3 | sort | uniq -c
 194 data
  96 github_actions_secret
  96 github_branch_protection
  96 github_repository
 194 github_team_repository