terraform-provider-aws: [Bug]: Memory consumption has increased since v4.67.0
Terraform Core Version
1.4.5
AWS Provider Version
4.67.0, 5.0.1, 5.1.0
Affected Resource(s)
Seems not related to specific resources. For example, memory usage increases even when only declaring provider without resources.
Expected Behavior
Memory usage is kept the same by updating the provider.
Actual Behavior
In our actual environment setting up GuardDuty and EventBus on all available regions (17) on 5 AWS accounts, max RSS during refresh was measured as:
/usr/bin/time -v terraform plan -refresh-only
| provider version | v4.66.1 | v4.67.0 | v5.0.1 |
|---|---|---|---|
| Max RSS | 2,645M | 3,654M | 3,634M |
| Increase | - | +38% | +37% |
Relevant Error/Panic Output Snippet
No response
Terraform Configuration Files
https://github.com/at-wat/terraform-provider-aws-v4.67-memory-usage/
Steps to Reproduce
See https://github.com/at-wat/terraform-provider-aws-v4.67-memory-usage/blob/main/README.md
This measures max RSS when refreshing aws_iam_policy_document data resources on 17 regions.
It results in a similar increase rate to our actual infrastructure code.
Debug Output
No response
Panic Output
No response
Important Factoids
No response
References
No response
Would you like to implement a fix?
None
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 40
- Comments: 17 (12 by maintainers)
Commits related to this issue
- internal/tags: Used shared memory allocation for shared tag schemas Reference: https://github.com/hashicorp/terraform-provider-aws/issues/31722 Rather than allocating memory for each shared tag sche... — committed to hashicorp/terraform-provider-aws by bflad a year ago
- service/wafv2: Allocate memory once for emptySchema() Reference: https://github.com/hashicorp/terraform-provider-aws/issues/31722 Happened to notice this was generating a decent number of memory all... — committed to hashicorp/terraform-provider-aws by bflad a year ago
Hi all! ๐ Wanted to give an update with our current status. The Terraform AWS Provider, Terraform DevEx and Terraform Core teams are all engaged in working towards a solution. The main problem is peak memory usage and this peak occurs when Terraform makes calls to configured providers to load their resource/data source schemas. The Terraform protocol contains a single RPC which asks for all schemas regardless of the specific resources configured. This means that as the AWS provider grows, the memory requirements for using it also grow. Some specific resources such as QuickSight and WAFv2 have extremely large nested schemas which can have an outsized effect on memory as we have seen.
Changing Terraform to load only the schemas for configured resources is likely feasible, but it is a very large undertaking requiring changes to the protocol (and thus changes to Terraform, terraform-plugin-framework, terraform-plugin-sdk and the provider). This change should decrease the memory footprint significantly for most users, and memory requirements would vary based on the particular resource configured. At this point we are investigating the feasibility of this change, but can make no indication of timings.
Short term however, there are a number of smaller optimizations we can do which we hope will alleviate some of the pressure while we research more longer term improvements. Some of these modifications will ship in the next version of the AWS provider. We will also be more intentional about accepting PR submissions for resources/data sources with particularly massive schema at least until their memory requirements are more closely understood. In some cases we may need to leave these out until we can resolve the root issue. More updates to come when we have details we can share.
Hi @at-wat ๐, thank you for the legwork in profiling this issue! Just to let you know we are actively investigating this with high priority in order to find a resolution. Will update this thread once we have a way forward.
Updated the reproduction repo: https://github.com/at-wat/terraform-provider-aws-v4.67-memory-usage/
and confirmed that the memory usage is significantly reduced on terraform-provider 5.20.0 and terraform 1.6.0
Happy Friday everyone!
I do have an update for you. Terraform 1.6 (currently in alpha) will include new functionality that will allow a cached provider schema to be used which should significantly reduce memory consumption for configurations including multiple instances of the same provider. In the near future terraform-plugin-go / terraform-plugin-sdk and terraform-plugin-framework will all be updated to take advantage of this, and provider authors (including the AWS Provider maintainers) will need to update these dependencies for this to take effect. I canโt give timings for when these changes will all be available, but this should be a significant step forward for those experiencing memory issues with multiple AWS providers configured.
Additionally, we have introduced a regex cache to the provider (released in v5.14.0) which in testing seems to have a significant impact on memory consumption.
We are continuing to work with the Terraform Core and DevEx teams on further improvements, and will update this issue with new information as I have it.
Community Note
Voting for Prioritization
Volunteering to Work on This Issue
Thanks for the research @at-wat , weโve had to double the size of our codebuild runners to avoid these kind of errors that are due to memory exhaustion:
[โThe plugin.(*GRPCProvider).ValidateDataResourceConfig request was cancelled.โ]Like @n00borama, weโve had to double the size of our runners to avoid these errors which of course has a negative cost impact ๐
Also tried to play around with parallelism by lowering it from the default of 10 to 4 but no joy.
Hi @breathingdust, hope all is well. Are there any updates on this you could share out of curiosity? Have there been any developments?
https://github.com/hashicorp/terraform-provider-aws/issues/31722#issuecomment-1573255055 New large object appeared in pprof graph of v5.1.0 is from
internal/service/quicksight/schema.visualsSchema()It may be related to the new quicksight resources: