pulumi: Persistent 429 Rate limit errors from GCS bucket when using GCS State backend for Pulumi

Since yesterday (31 March 2020), we have been seeing persistent 429 rate limit errors with the GCS state backend for pulumi. See this error irrespective of the size of pulumi stacks (even on newly created stacks) and even when only a single object from the stack is being created or modified. Also tried pulumi up --parallel 1 for single threaded execution, but still see this error.

Diagnostics:
  gcp:kms:CryptoKeyIAMBinding (xxx-permissions):
    error: post-step event returned an error: failed to save snapshot: An IO error occurred during the current operation: blob (key ".pulumi/stacks/<stack-name>.json") (code=Unknown): googleapi: Error 429: The rate of change requests to the object <gcs-bucket-name>/.pulumi/stacks/<stack-name>.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded

  pulumi:pulumi:Stack (<pulumi-project-name>-<stack-name>):
    error: update failed

Sometimes, the state file gets deleted in this process too when 429 errors are received, which is weird.

We may have to give up on using GCS buckets entirely for storing pulumi state. Does anybody know about what could be causing this issue or any workarounds? Thanks.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 14
  • Comments: 43 (19 by maintainers)

Commits related to this issue

Most upvoted comments

I met with some of the team regarding this issue on Friday to see if we could figure out a path forward. Here’s where we ended up:

  • The first intention is to try and recreate this behaviour locally, which has been something of a struggle. We decided to change course and instead of attempting to recreate GCS’s rate limiting, mock a backend that will sporadically throw 429 responses. This should help us drive improvements. Unfortunately, it may be that any potential improvements here might be dependent on us rethinking the way handle state files on the cloud provider backends

  • We acknowledge that we need to drive some improvements on the cloud provider backends across the board. There are several limitations to the current implementation which aren’t helping with this issue, such as writing all state objects to a single blob.

To cut a long story short - we realize there is work to do here and that this is a painful experience for everyone using this backend and we’re going to work hard on getting fixes out for this as soon as possible.

I’ve been able to reproduce consistently with:

  1. Create a “multi-region” GCS bucket in Asia and pulumi login to it
  2. Run a program that creates 100 aws.s3.BucketObject resources
  3. About half the resource operations fail with 429 errors
  4. The statefile ends up deleted from storage and pulumi stack ls no longer thinks the stack exists (though there is a .bak file).

We could try and catch the error during the WriteAll call and implement our own retry/backoff logic there

Yeah - this sounds like the most immediately addressable option. The patch below appears to fix this for me. I’m not sure this is the way we want to actually fix this (if you open a PR, would be good to get @pgavlin’s input) - but something like this may be a pragmatic step to prevent the current issue while we look into deeper rearchitecture to allow this to be done more specifically targeted at the GCS API calls.

diff --git a/pkg/backend/filestate/backend.go b/pkg/backend/filestate/backend.go
index 1bb4fce5..de3ce3f0 100644
--- a/pkg/backend/filestate/backend.go
+++ b/pkg/backend/filestate/backend.go
@@ -25,6 +25,7 @@ import (
 	"path/filepath"
 	"regexp"
 	"strings"
+	"sync"
 	"time"
 
 	"github.com/pkg/errors"
@@ -72,6 +73,7 @@ type localBackend struct {
 	url         string
 
 	bucket Bucket
+	mutex  sync.Mutex
 }
 
 type localBackendReference struct {
diff --git a/pkg/backend/filestate/state.go b/pkg/backend/filestate/state.go
index be274a1a..adc33cb2 100644
--- a/pkg/backend/filestate/state.go
+++ b/pkg/backend/filestate/state.go
@@ -24,6 +24,8 @@ import (
 	"strings"
 	"time"
 
+	"github.com/pulumi/pulumi/sdk/v2/go/common/util/retry"
+
 	"github.com/pulumi/pulumi/pkg/v2/engine"
 
 	"github.com/pkg/errors"
@@ -162,6 +164,9 @@ func (b *localBackend) getCheckpoint(stackName tokens.QName) (*apitype.Checkpoin
 }
 
 func (b *localBackend) saveStack(name tokens.QName, snap *deploy.Snapshot, sm secrets.Manager) (string, error) {
+	b.mutex.Lock()
+	defer b.mutex.Unlock()
+
 	// Make a serializable stack and then use the encoder to encode it.
 	file := b.stackPath(name)
 	m, ext := encoding.Detect(file)
@@ -183,9 +188,21 @@ func (b *localBackend) saveStack(name tokens.QName, snap *deploy.Snapshot, sm se
 	// Back up the existing file if it already exists.
 	bck := backupTarget(b.bucket, file)
 
-	// And now write out the new snapshot file, overwriting that location.
-	if err = b.bucket.WriteAll(context.TODO(), file, byts, nil); err != nil {
-		return "", errors.Wrap(err, "An IO error occurred during the current operation")
+	_, _, err = retry.Until(context.TODO(), retry.Acceptor{
+		Accept: func(try int, nextRetryTime time.Duration) (bool, interface{}, error) {
+			// And now write out the new snapshot file, overwriting that location.
+			err := b.bucket.WriteAll(context.TODO(), file, byts, nil)
+			if err != nil {
+				if try > 10 {
+					return false, nil, errors.Wrap(err, "An IO error occurred during the current operation")
+				}
+				return false, nil, nil
+			}
+			return true, nil, nil
+		},
+	})
+	if err != nil {
+		return "", err
 	}
 
 	logging.V(7).Infof("Saved stack %s checkpoint to: %s (backup=%s)", name, file, bck)

@AgrimPrasad we just merged this into master, which means it’ll be in our next release. Thanks for your patience!

@confiq I just noticed you’re using 2.1.0, this fix isn’t in that release. We’re cutting a 2.1.1 with this included

I see the same issue with Pulumi 2.0.0

Why some users see a significant improvement on 1.13.0 vs newer versions

anecdotally i’ve found versions of pulumi newer than 1.13 to be much faster so perhaps that’s why some have better luck on 1.13 🤷‍♂️

I’ve done some more investigation - here’s what I know so far.

  1. The underlying gocloud.dev library performs retries assuming you don’t cancel the context passed.
  2. During a saveStack call, we run WriteAll to save the stack state as part of the step execution. We cancel the context if there is an error which there technically is from the GCS backend.

The potential solutions I can see here aren’t great in the short term. Some ideas:

  • We could try and catch the error during the WriteAll call and implement our own retry/backoff logic there. Unfortunately, the error code that is thrown has an unknown code (example: (code=Unknown): googleapi: Error 429: The rate of change requests to the object <pulumi-gcs-bucket-name>/.pulumi/stacks/<pulumi-stack-name>.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded ) so we may swallow some important errors…
  • We could not cancel the context if it’s a saveStack operation. I’m not sure what the implications of this would be

Things I still don’t know about this issue:

  • Why some users see a significant improvement on 1.13.0 vs newer versions

@lukehoban @leezen - any ideas on how you’d like to proceed here?

I’ve started running into this problem as well.

Downgraded to 1.13.0 and worked perfectly again.

brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/873d3245fb27476ee90551f8d43ad15af384dc48/Formula/pulumi.rb