pulumi: Persistent 429 Rate limit errors from GCS bucket when using GCS State backend for Pulumi
Since yesterday (31 March 2020), we have been seeing persistent 429 rate limit errors with the GCS state backend for pulumi. See this error irrespective of the size of pulumi stacks (even on newly created stacks) and even when only a single object from the stack is being created or modified. Also tried pulumi up --parallel 1 for single threaded execution, but still see this error.
Diagnostics:
gcp:kms:CryptoKeyIAMBinding (xxx-permissions):
error: post-step event returned an error: failed to save snapshot: An IO error occurred during the current operation: blob (key ".pulumi/stacks/<stack-name>.json") (code=Unknown): googleapi: Error 429: The rate of change requests to the object <gcs-bucket-name>/.pulumi/stacks/<stack-name>.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded
pulumi:pulumi:Stack (<pulumi-project-name>-<stack-name>):
error: update failed
Sometimes, the state file gets deleted in this process too when 429 errors are received, which is weird.
We may have to give up on using GCS buckets entirely for storing pulumi state. Does anybody know about what could be causing this issue or any workarounds? Thanks.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 14
- Comments: 43 (19 by maintainers)
I met with some of the team regarding this issue on Friday to see if we could figure out a path forward. Here’s where we ended up:
The first intention is to try and recreate this behaviour locally, which has been something of a struggle. We decided to change course and instead of attempting to recreate GCS’s rate limiting, mock a backend that will sporadically throw 429 responses. This should help us drive improvements. Unfortunately, it may be that any potential improvements here might be dependent on us rethinking the way handle state files on the cloud provider backends
We acknowledge that we need to drive some improvements on the cloud provider backends across the board. There are several limitations to the current implementation which aren’t helping with this issue, such as writing all state objects to a single blob.
To cut a long story short - we realize there is work to do here and that this is a painful experience for everyone using this backend and we’re going to work hard on getting fixes out for this as soon as possible.
I’ve been able to reproduce consistently with:
pulumi loginto itaws.s3.BucketObjectresourcespulumi stack lsno longer thinks the stack exists (though there is a.bakfile).Yeah - this sounds like the most immediately addressable option. The patch below appears to fix this for me. I’m not sure this is the way we want to actually fix this (if you open a PR, would be good to get @pgavlin’s input) - but something like this may be a pragmatic step to prevent the current issue while we look into deeper rearchitecture to allow this to be done more specifically targeted at the GCS API calls.
@AgrimPrasad we just merged this into master, which means it’ll be in our next release. Thanks for your patience!
@confiq I just noticed you’re using
2.1.0, this fix isn’t in that release. We’re cutting a2.1.1with this includedI see the same issue with Pulumi 2.0.0
anecdotally i’ve found versions of pulumi newer than 1.13 to be much faster so perhaps that’s why some have better luck on 1.13 🤷♂️
I’ve done some more investigation - here’s what I know so far.
WriteAllto save the stack state as part of the step execution. We cancel the context if there is an error which there technically is from the GCS backend.The potential solutions I can see here aren’t great in the short term. Some ideas:
WriteAllcall and implement our own retry/backoff logic there. Unfortunately, the error code that is thrown has anunknowncode (example:(code=Unknown): googleapi: Error 429: The rate of change requests to the object <pulumi-gcs-bucket-name>/.pulumi/stacks/<pulumi-stack-name>.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded) so we may swallow some important errors…saveStackoperation. I’m not sure what the implications of this would beThings I still don’t know about this issue:
1.13.0vs newer versions@lukehoban @leezen - any ideas on how you’d like to proceed here?
I’ve started running into this problem as well.
Downgraded to 1.13.0 and worked perfectly again.
brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/873d3245fb27476ee90551f8d43ad15af384dc48/Formula/pulumi.rb