sst: StackMetadata causes CloudFormation to hang on creation & deletion, v1.18.4

Hi all, We have been trialling Serverless Stack with a lot of excitement but today we hit a stumbling block.

We were on v1.15.16, and then when I tried to use the bind method, I realised we were too many versions behind, so we bumped the version all the way to latest (v1.18.4).

Then when we tried to do npx sst start to continue development, the deployment hung indefinitely (~ 30min) on a small stack. One item was permanently stuck on CREATE_IN_PROGRESS, which was StackMetadata.

Eventually I tried to delete the stack from CloudFormation and start again from scratch. But when I tried this, the deletion failed with the following message:

  • The following resource(s) failed to delete: [StackMetadata].

I then noticed that v1.18.4 had an update with the message “Fix throttled error on updating stack metadata”, but I don’t know if the two are related. I tried to find the corresponding AWS resource but wasn’t sure where to look, I had the impression it was somehow related to Stack Manager but I couldn’t find anything obvious with a name like localdev-myapp-MyStack-.....

Anyway, I ended up able to delete the stack again eventually, somehow the stack metadata was able to be deleted the second time. I’ve rolled back the app version to v1.18.2 and I am able to deploy again (taking ~3min).

Thank you for any assistance and information on what might have gone wrong!


Confounding factors:

  • We were trying out EventBus rules at the time; at one point one of the EventBus rules reported to CloudFormation that it needed manual resource creation, which I thought was unusual, but it was eventually created.
  • I deleted one old stack which was removed from the app, and renamed the remaining stack
  • I changed two resource names (1x EventBus and 1x Api construct)
  • It’s Black Friday, maybe AWS was just having a rough day

For what it’s worth, the stack consisted of the following:

export function MainStack({ stack }: StackContext) {
  const bus = new EventBus(stack, "MainBus");

  const queue = new Queue(stack, "TestQueue", {
    consumer: "functions/debuglogger/lambda.handler",
  });

  const api = new Api(stack, "MainApi", {
    defaults: {
      function: {
        environment: {
          MAIN_BUS_NAME: bus.eventBusName,
        },
      },
    },
  });

  bus.addRules(stack, {
    testRule: {
      pattern: { source: ["some.anonymised.event.source"] },
      targets: {
        queue: queue,
        debuglogger: "functions/debuglogger/lambda.handler",
      },
    },
  });

  api.addRoutes(stack, {
    "POST /adaptwest/reportbridge": {
      function: {
        handler: "functions/oneofourclients/reportbridge/lambda.handler",
      },
    },
  });

  stack.addOutputs({
    ApiEndpoint: api.url,
    EventBusName: bus.eventBusName,
  });

  return { bus, api };
}

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

This is fixed in v2 but wondering if we need to backport the fix?

If you compare the diff between 1.18.2 and 1.18.4 here

It seems that the culprit for the hang is the maxRetries: 1000:

  maxRetries: 1000,
  retryDelayOptions: {
    customBackoff: () => 3000,
  },

at least partially anyway.

As mentioned earlier in this thread - we rolled back to version 1.18.2 and the problem has disappeared.

I used the CLI command sst update 1.18.2 to get the correct dependencies. Pretty sure I nuked the node_modules directory (rm -rf node_modules) and the package-lock.json file first though.

@ceigey quick update on the issue. We finally found the root cause. It’s a CDK bug where the Lambda zip gets corrupted when uploading multiple concurrently. We’ve put in a fix in SST v2. Will cut a release candidate soon.

@michaelleeallen I’m not sure if it’s caused by the upload issue. Can you go to the AWS Lambda console, find the db migration handler, and export the code? It’d help clarifying if the upload is corrupted.

@CodyDunlap r u having the issue w/ metadata or db migration? Btw, I gave it some test, emptying the CDK assets bucket does not affect deployed apps. So safe to deploy.