terraform-provider-github: Slow performance when managing dozens of repositories
Terraform Version
0.12.6
Affected Resource(s)
Please list the resources as a list, for example:
github_repositorygithub_branch_protectiongithub_team_repositorygithub_actions_secret
Terraform Configuration Files
Here’s our repo module (slightly redacted ****):
terraform {
required_providers {
github = ">= 3.1.0"
}
}
locals {
# Terraform modules must be named `terraform-<provider>-<module name>`
# so we can extract the provider easily
provider = element(split("-", var.repository), 1)
}
data "github_team" "****" {
slug = "****"
}
data "github_team" "****" {
slug = "****"
}
resource "github_repository" "main" {
name = var.repository
description = var.description
visibility = var.visibility
topics = [
"terraform",
"terraform-module",
"terraform-${local.provider}"
]
has_issues = var.has_issues
has_projects = var.has_projects
has_wiki = var.has_wiki
vulnerability_alerts = true
delete_branch_on_merge = true
archived = var.archived
dynamic "template" {
for_each = var.fork ? [] : [var.fork]
content {
owner = "waveaccounting"
repository = "****"
}
}
}
resource "github_branch_protection" "main" {
repository_id = github_repository.main.node_id
pattern = github_repository.main.default_branch
required_status_checks {
strict = true
contexts = [
"Terraform",
"docs",
]
}
required_pull_request_reviews {
dismiss_stale_reviews = true
require_code_owner_reviews = true
}
}
resource "github_team_repository" "****" {
team_id = data.github_team.****.id
repository = github_repository.main.name
permission = "admin"
}
resource "github_team_repository" "****" {
team_id = data.github_team.****.id
repository = github_repository.main.name
permission = "admin"
}
resource "github_actions_secret" "secrets" {
for_each = var.secrets
repository = github_repository.main.name
secret_name = each.key
plaintext_value = each.value
}
Actual Behavior
We are managing approximately 90 repositories using this module via Terraform Cloud remote operations (which means we can’t disable refresh or change parallelization afaik). I timed a refresh + plan: 9m22s (562s) == 6.2s per repository
Are there any optimizations we can make on our side or in the github provider / API to try to improve this? We’re discussing breaking up our repos into smaller workspaces, but that feels like a bit of a hack.
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
terraform planon large numbers of repositories / branch protection configs
Important Factoids
- Running on Terraform Cloud Remote Operation
References
- Similar issue to https://github.com/terraform-providers/terraform-provider-github/issues/565, although things weren’t particularly fast before the update either
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 60
- Comments: 40 (15 by maintainers)
This is very much still a thing, we recently rev’d to latest in our infra and it seems to have only gotten worse 😦
👋 Hey Friends, this issue has been automatically marked as
stalebecause it has no recent activity. It will be closed if no further activity occurs. Please add theStatus: Pinnedlabel if you feel that this issue needs to remain open/active. Thank you for your contributions and help in keeping things tidy!Can we restore the original codepath that did not use graphql? Many of us are going to be abandoned in v2.9.2 as a result of this performance problem. I’m not sure what benefit graphql has to the end user other, I’m sure it’ll be faster someday without the hundreds of REST calls, but it’s a noop at best from my perspective as a user, and github’s problem not “mine” as long as we stay under the API call limit.
Sorry to be grumpy but this is a pretty serious issue. Terraform sometimes times out waiting.
One thing that worked from me and reduced the time from 10min to 1.5min to plan/apply (50repos - 3teams/repo) was to remove the data “github_team”.
@mwarkentin use the team slug directly instead of doing a data and using it’s output.
I know it’s a good approach to use the “data” as a safety measure but in this case it will reduce the plan/apply in 90%
+1 – thanks to everyone taking this issue seriously. We also manage 55 repos, and I was hoping to one day manage about 400, but this has slowed us down too much to consider upgrading past 2.9.2.
FWIW a targeted plan of a single repo resources and with five branch protections, WITHOUT refresh and WITHOUT lock, took 15 seconds of real time:
I’d love for graphql to cut the thousands and thousands of REST calls we do per plan down, but not for such an increase in time. I suspect some of this work may be moot/pre-optimization given the refresh improvements in 0.14.0 (as I understand them)–perhaps we could go back to REST entirely for these calls until the issues are sorted?
A full plan without refresh and lock took:
And a full plan with refresh and lock:
For comparison, a full plan with refresh and lock in 2.9.2:
So on my machine, 3.0+ is 42% slower in a big plan, and in my testing, 300% slower in a single repo with five branches to protect. For reference, I’m on a 2019 MacBook Pro, maxed out, on a 75MB symmetric connection.
We’re seeing it take ~10 minutes for our 331 repos plus associated Buildkite pipelines:
github v4.10.1
Adding
-parallelism=50toterraform plandidn’t seem to make a difference.Edit: That’s ~1.81s per repo, which doesn’t seem like much, but it adds up
Edit 2: I’ve realised we’re on an older version of Terraform, v0.14.8. I’ll try on the latest
Edit 3: No dice; just as slow on Terraform v0.15.5
@jcudit
github_branch_protectionandgithub_repository_fileis where I’ve seen the massive speed and rate limit issuesWe have thousands of repositories (split into hundreds of separate builds), and this is by far the biggest bottleneck in our automation. This has impacts such as:
Right now, I cannot think of a bigger improvement (provider-wise) than addressing this performance issue. I really hope ythis issue would be prioritised higher - for our organisation, being to easily automate managing our repositories was a big driver from moving from Bitbucket to GitHub. Now we’re here on GitHub, and it’s better, but that issue is a big thorn for us.
Issue opened on Oct 15, 2020, just sayin’ =P
Update: using
github_organization_teamswith some transformation logic in mylocalsprovides a similar reduction to my build time (assuming that it’s only called once) compared to many individualgithub_teamobjects.Map indexed by team slug:
Ya this has been a challenge with the provider from the very beginning. I had to design my states around things like aggressive rate limits, poorly optimized queries, lack of cache, etc on top of larger organization scale. I think the best solution at the moment is to split into smaller states based on some criteria such as team ownership.
Given the results above, here are some recommendations:
github_branch_protectionback to the REST implementation, but handle unsupported configurations with the GraphQL implementationgithub_branch_protectionback to the REST implementation, forfeiting the features available via the GraphQL implementationI will attempt to get the first option into the upcoming bugfix release.
The second option is larger than I have time for but welcome PRs from contributors.
The last two options will need feedback / sign-off from the community. Interested to hear more opinions here if the first two options do not play out well.
current query
Thanks for confirming. Adding
github_branch_protectionto the test above reveals this poor performance.Will focus efforts there and will track
github_repository_filein https://github.com/terraform-providers/terraform-provider-github/issues/568 🙇I see your point @schans, actually there are tens of workarounds, but the point is that this provider should work, this issue as explained by @restless-orca is defeating the whole purpose of the provider itself. I should not be put in a position to implement workarounds in order to make it work in a satisfying way. Keep in mind we are talking about an issue that impacts the core of the provider, not a specific feature.
We have 300+ repos with rulesets and it takes ages to work with it. About 15mins for the plan to run…
In our case we have 11 different
github_releasedata sources. After few terraform plan/apply runs it’s going to take at least 15 minutes to pull data for these 11 data sources. TF_LOG=trace reveals these messages from the provider:So I guess we can just look for this string in the source and throw it away together with any wait/retry code around it?..
Or should we use GitHub auth for the plugin to increase these limits?..
Adding on - 26 minute run to manage 150 repos. Using
summary_only = truereduced this to 14 minutes, and usinggithub_branch_protection_v3reduced it to a few minutes.Just adding my 2c in here, I had to do a fully targeted plan to catch some drift on our very large Terraform workspace. Normally we overcome the performance by doing a targeted apply based on what changes. This comes with limitations of traversing modules etc.
In the case yesterday, I thought “I’ll have a beer while this plan chugs along” and it was me who ended up doing all the chugging. ~30-45 mins to plan 3500 resources.
I manage about ~100 repos in my personal account, and I keep hitting rate limit issues, I’m using:
github_repository,github_branch_default,github_branch_protection,github_actions_secretonly…anybody figured work arounds / optimizations for the rate limit issues? this was never an issue before the GraphQL refactor of this provider…
@RuiSMagalhaes THANK YOU ❤️ I can confirm this is the case.
I don’t get why the data source is so slow but this helps a lot 😍 🚤 🐎
Hi all,
To share our current status as this might work for others as well. We have implemented a workaround or maybe better labelled a trade-off and for us this solution is now “fast” for daily operations without the need to split up in multiple projects.
What we did:
The trade-off here is that there is a risk that the state is not up to date and the plan and/or apply might run into errors. For us this is acceptable since:
HTH
We’re currently close to 500 repos and on the point of moving away from using terraform to manage because runs take over 10 minutes.
The only way I can see this dramatically improve is if all data if fetched at once from the GitHub (GraphQL) api (and cached), because the large amount api calls will always be slow. We have a simple python script that checks for repos not managed by terraform that does this and that runs in 16s. But I don’t know if this approach breaks the “terraform model” 🤷
I’m really hoping #395 will be prioritized, as aside from the security improvements it would allow for a combinatoric decrease in the number of resources needed to manage a large number of repos.
@jcudit if it’s still helpful: