pulumi: Persistent 429 Rate limit errors from GCS bucket when using GCS State backend for Pulumi

Since yesterday (31 March 2020), we have been seeing persistent 429 rate limit errors with the GCS state backend for pulumi. See this error irrespective of the size of pulumi stacks (even on newly created stacks) and even when only a single object from the stack is being created or modified. Also tried pulumi up --parallel 1 for single threaded execution, but still see this error.

Diagnostics:
  gcp:kms:CryptoKeyIAMBinding (xxx-permissions):
    error: post-step event returned an error: failed to save snapshot: An IO error occurred during the current operation: blob (key ".pulumi/stacks/<stack-name>.json") (code=Unknown): googleapi: Error 429: The rate of change requests to the object <gcs-bucket-name>/.pulumi/stacks/<stack-name>.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded

  pulumi:pulumi:Stack (<pulumi-project-name>-<stack-name>):
    error: update failed

Sometimes, the state file gets deleted in this process too when 429 errors are received, which is weird.

We may have to give up on using GCS buckets entirely for storing pulumi state. Does anybody know about what could be causing this issue or any workarounds? Thanks.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 14
Comments: 43 (19 by maintainers)

Commits related to this issue

Change the wording of some IO errors This probably seems like a trivial change, but while debugging #4258 it was apparent that an error could have come from 3 different places. This rewords an error ... — committed to jaxxstorm/pulumi by jaxxstorm 4 years ago

Most upvoted comments

I met with some of the team regarding this issue on Friday to see if we could figure out a path forward. Here’s where we ended up:

The first intention is to try and recreate this behaviour locally, which has been something of a struggle. We decided to change course and instead of attempting to recreate GCS’s rate limiting, mock a backend that will sporadically throw 429 responses. This should help us drive improvements. Unfortunately, it may be that any potential improvements here might be dependent on us rethinking the way handle state files on the cloud provider backends
We acknowledge that we need to drive some improvements on the cloud provider backends across the board. There are several limitations to the current implementation which aren’t helping with this issue, such as writing all state objects to a single blob.

To cut a long story short - we realize there is work to do here and that this is a painful experience for everyone using this backend and we’re going to work hard on getting fixes out for this as soon as possible.

+11

jaxxstorm on Apr 14, 2020

I’ve been able to reproduce consistently with:

Create a “multi-region” GCS bucket in Asia and pulumi login to it
Run a program that creates 100 aws.s3.BucketObject resources
About half the resource operations fail with 429 errors
The statefile ends up deleted from storage and pulumi stack ls no longer thinks the stack exists (though there is a .bak file).

We could try and catch the error during the WriteAll call and implement our own retry/backoff logic there

Yeah - this sounds like the most immediately addressable option. The patch below appears to fix this for me. I’m not sure this is the way we want to actually fix this (if you open a PR, would be good to get @pgavlin’s input) - but something like this may be a pragmatic step to prevent the current issue while we look into deeper rearchitecture to allow this to be done more specifically targeted at the GCS API calls.

diff --git a/pkg/backend/filestate/backend.go b/pkg/backend/filestate/backend.go
index 1bb4fce5..de3ce3f0 100644
--- a/pkg/backend/filestate/backend.go
+++ b/pkg/backend/filestate/backend.go
@@ -25,6 +25,7 @@ import (
 	"path/filepath"
 	"regexp"
 	"strings"
+	"sync"
 	"time"
 
 	"github.com/pkg/errors"
@@ -72,6 +73,7 @@ type localBackend struct {
 	url         string
 
 	bucket Bucket
+	mutex  sync.Mutex
 }
 
 type localBackendReference struct {
diff --git a/pkg/backend/filestate/state.go b/pkg/backend/filestate/state.go
index be274a1a..adc33cb2 100644
--- a/pkg/backend/filestate/state.go
+++ b/pkg/backend/filestate/state.go
@@ -24,6 +24,8 @@ import (
 	"strings"
 	"time"
 
+	"github.com/pulumi/pulumi/sdk/v2/go/common/util/retry"
+
 	"github.com/pulumi/pulumi/pkg/v2/engine"
 
 	"github.com/pkg/errors"
@@ -162,6 +164,9 @@ func (b *localBackend) getCheckpoint(stackName tokens.QName) (*apitype.Checkpoin
 }
 
 func (b *localBackend) saveStack(name tokens.QName, snap *deploy.Snapshot, sm secrets.Manager) (string, error) {
+	b.mutex.Lock()
+	defer b.mutex.Unlock()
+
 	// Make a serializable stack and then use the encoder to encode it.
 	file := b.stackPath(name)
 	m, ext := encoding.Detect(file)
@@ -183,9 +188,21 @@ func (b *localBackend) saveStack(name tokens.QName, snap *deploy.Snapshot, sm se
 	// Back up the existing file if it already exists.
 	bck := backupTarget(b.bucket, file)
 
-	// And now write out the new snapshot file, overwriting that location.
-	if err = b.bucket.WriteAll(context.TODO(), file, byts, nil); err != nil {
-		return "", errors.Wrap(err, "An IO error occurred during the current operation")
+	_, _, err = retry.Until(context.TODO(), retry.Acceptor{
+		Accept: func(try int, nextRetryTime time.Duration) (bool, interface{}, error) {
+			// And now write out the new snapshot file, overwriting that location.
+			err := b.bucket.WriteAll(context.TODO(), file, byts, nil)
+			if err != nil {
+				if try > 10 {
+					return false, nil, errors.Wrap(err, "An IO error occurred during the current operation")
+				}
+				return false, nil, nil
+			}
+			return true, nil, nil
+		},
+	})
+	if err != nil {
+		return "", err
 	}
 
 	logging.V(7).Infof("Saved stack %s checkpoint to: %s (backup=%s)", name, file, bck)

lukehoban on Apr 24, 2020

@AgrimPrasad we just merged this into master, which means it’ll be in our next release. Thanks for your patience!

jaxxstorm on May 4, 2020

@confiq I just noticed you’re using 2.1.0, this fix isn’t in that release. We’re cutting a 2.1.1 with this included

jaxxstorm on May 11, 2020

I see the same issue with Pulumi 2.0.0

amkartashov on Apr 23, 2020

Why some users see a significant improvement on 1.13.0 vs newer versions

anecdotally i’ve found versions of pulumi newer than 1.13 to be much faster so perhaps that’s why some have better luck on 1.13 🤷‍♂️

Place1 on Apr 22, 2020

I’ve done some more investigation - here’s what I know so far.

The underlying gocloud.dev library performs retries assuming you don’t cancel the context passed.
During a saveStack call, we run WriteAll to save the stack state as part of the step execution. We cancel the context if there is an error which there technically is from the GCS backend.

The potential solutions I can see here aren’t great in the short term. Some ideas:

We could try and catch the error during the WriteAll call and implement our own retry/backoff logic there. Unfortunately, the error code that is thrown has an unknown code (example: (code=Unknown): googleapi: Error 429: The rate of change requests to the object <pulumi-gcs-bucket-name>/.pulumi/stacks/<pulumi-stack-name>.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded ) so we may swallow some important errors…
We could not cancel the context if it’s a saveStack operation. I’m not sure what the implications of this would be

Things I still don’t know about this issue:

Why some users see a significant improvement on 1.13.0 vs newer versions

@lukehoban @leezen - any ideas on how you’d like to proceed here?

jaxxstorm on Apr 21, 2020

I’ve started running into this problem as well.

Downgraded to 1.13.0 and worked perfectly again.

brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/873d3245fb27476ee90551f8d43ad15af384dc48/Formula/pulumi.rb

matzkoh on Apr 1, 2020