aws-cdk: (custom-resources): empty onEvent handler zip's being created, failing deploys

Describe the bug

We recently started to see our integration tests failing, even though deploys were succeeding. The failures on the integration tests look like this:

sent 1,788 bytes  received 35 bytes  3,646.00 bytes/sec
total size is 1,680  speedup is 0.92
fatal: Not a valid object name integ
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/x/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/xxx/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.8e18eb5caccd2617fb76e648fa6a35dc0ece98c4681942bc6861f41afdff6a1b.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/xxx/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.e2277687077a2abf9ae1af1cc9565e6715e2ebb62f79ec53aa75a1af9298f642.zip'

 ❌ Deployment failed: Error: Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
    at Deployments.publishSingleAsset (/home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:11458)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.publishAsset (/home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:151474)
    at async /home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:136916
Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
  FAILED     integ/constructs/xyz/integ.cluster-IntegTest/DefaultTest (undefined/us-east-1) 29.[135](https://github.com/Nextdoor/xxx/actions/runs/6340275359/job/17221306412#step:11:136)s
      Integration test failed: TypeError [ERR_STREAM_NULL_VALUES]: May not write null values to stream

When we then look in our S3 bucket, we find a series of 22 byte sized zip files. These three images are from three separate build attempts, all with fresh empty cdk.out directories, and all after we had wiped out the S3 cache files:

Screenshot 2023-09-28 at 7 41 49 AM Screenshot 2023-09-28 at 7 20 22 AM Screenshot 2023-09-28 at 7 25 17 AM

When we dug into it, it seems that these files are all related to the onEvent handlers for the custom-resource constructs. Going back in time a bit, it looks like these hash values show up at or around https://github.com/aws/aws-cdk/commit/a9ed64f2aa8014626857dfdfb33a823cd9cfd1fa#diff-8bf3c7acb1f51f01631ea642163612a520b448b843d7514dc31ccc6f140c0753

Attempts to fix

Roll back to 2.90.0 - success

We tried to roll back to 2.87.0 - but our codebase would have required too many changes for that, so we were able to roll back to 2.90.0 though which is interestingly before several of the handlers were updated from Node16 to Node18.

When we rolled back to 2.90.0, the integration tests work fine.

Roll forward to 2.91.0 - success

Same as 2.90.0 - the tests work fine.

Roll forward to 2.92.0 - partial success

In https://github.com/aws/aws-cdk/releases/tag/v2.92.0, the custom-resources handler is bumped to use Node18 instead of Node16. That change creates the new asset hash a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5. This code mostly worked - however https://github.com/aws/aws-cdk/issues/26771 prevented us from fully testing the CDK construct for EKS.

Roll forward to 2.93.0 - success

In 2.93.0, we see the asset hash change from 3f579d6c1ab146cac713730c96809dd4a9c5d9750440fb835ab20fd6925e528c.zip -> 9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip. It seems that this release works just fine - though the tests are ongoing right now.

Roll forward to 2.94.0 - failure

It seems that the failure starts as soon as we hit the 2.94.0 release.

INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.e2277687077a2abf9ae1af1cc9565e6715e2ebb62f79ec53aa75a1af9298f642.zip'
INFRA-MYAPP-ClusterTest:  fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/inframyapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.8e18eb5caccd2617fb76e648fa6a35dc0ece98c4681942bc6861f41afdff6a1b.zip'

Rolling back to ‘2.93.9’ - success

Rolling back to 2.93.0 after the 2.94.0 failure immediately works… builds and integration tests pass again.

Expected Behavior

A few things here…

  1. I obviously don’t expect the zip files to be created empty and causing problems.
  2. I would expect the files are cleaned up or replaced when they are determined to be corrupt.

Current Behavior

As far as we can tell, once the corrupt file is created - there are some situations where it is uploaded to S3 (and thus poisoning the cache), and other situations where the upload fails to begin with.

Reproduction Steps

Working on this … don’t yet know exactly how to reproduce this

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.95.0+

Framework Version

No response

Node.js Version

18

OS

Linux and OSX

Language

Typescript

Language Version

No response

Other information

No response

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 22 (10 by maintainers)

Most upvoted comments

@mrgrain, So … first, thank you for taking the time to respond, I really do appreciate it. In reading your comments, you are definitely right that I missed the comment in the README and it’s pretty explicit (though I have an edge case I’ve run into, and I’ll comment separately from this one on it to see if you have any ideas). I think that the integ-runner code is really critical in larger CDK environments to execute realistic tests … so we’ve worked really hard to adopt it as a default in most of our CDK projects. In fact, we’ve built a full Github-Action/PR-based workflow where integration tests are run by users when they submit new PRs via some PR comments. With that type of a setup, it’s really critical that the tests are reliable, so that failures are truly related to the users PR itself.

Improving the UX

Right off the bat … I think that if the integ-runner command was able to proactively verify that the assets all existed before beginning a test, it would have dramatically improved things and probably saved me 10’s of hours of debugging and troubleshooting. I just imagine that it could use the lookup role to verify that the resources exist (all the assets), and if any are missing, it could throw a big red error to tell the user that they can’t do an “update workflow test”. Thoughts?

Snapshots in Git

I’m curious how you’ve seen this used in the past… when it comes to building small dedicated CDK libraries that do a specific thing, I can imagine that storing the full snapshots isn’t really a big deal … but in our case we’re launching integration tests to spin up real Kubernetes clusters along with a dozen different Lambda functions. These functions and the assets get pretty big:

% du -sch cdk.out test
136M	cdk.out
 71M	test
207M	total

Do you realistically see people storing these in Git - and updating them? Virtually every AWS CDK release makes changes to the Lambda handler code in some way, which causes new hash’s to be generated, causing new functions to be built and new assets to be created.

I’m not complaining … just trying to figure out what the realistic pattern is here. Our NodeJS functions aren’t too big - but we have a couple of Python functions that get big. For example:

 16K    MYREPO-NativeClusterTest.assets.json
 48K    MYREPO-NativeClusterTest.template.json
 16K    MYREPONativeClusterTestCleanupStackA5C06CE2.nested.template.json
 40K    MYREPONativeClusterTestContinuousDeployment28A15EF4.nested.template.json
 80K    MYREPONativeClusterTestCorePluginsBB9AD3A8.nested.template.json
 12K    MYREPONativeClusterTestDns05AFFC71.nested.template.json
8.0K    MYREPONativeClusterTestKubeSystemNodesF64F789A.nested.template.json
 20K    MYREPONativeClusterTestNetworkPrep159B41F9.nested.template.json
 44K    MYREPONativeClusterTestOcean639A0FD8.nested.template.json
4.0K    MYREPONativeClusterTestRemoteManagementD093FD97.nested.template.json
 96K    MYREPONativeClusterTestSupplementalPlugins7C1CEFC9.nested.template.json
 16K    MYREPONativeClusterTestVpc42B5454F.nested.template.json
8.0K    MYREPONativeClusterTestndawseksNdKubectlProvider5DDA391D.nested.template.json
4.0K    IntegTestDefaultTestDeployAssertE3E7D2A4.assets.json
4.0K    IntegTestDefaultTestDeployAssertE3E7D2A4.template.json
 24K    asset.1471fa6f2876749a13de79989efc6651c9768d3173ef5904947e87504f8d7069
1.1M    asset.283efd6aefae7121bcf6bd25901fcb60ecd8b58bcd34cb8b91d8d8fc5322f62c
 16M    asset.3322b7049fb0ed2b7cbb644a2ada8d1116ff80c32dca89e6ada846b5de26f961.zip
 12K    asset.350497850828a0108f064a8cb783dd16d04637d20593411e21cc5b4f9e485cd6
4.0K    asset.4e26bf2d0a26f2097fb2b261f22bb51e3f6b4b52635777b1e54edbd8e2d58c35
4.1M    asset.6d93bc9532045758cbb4e2faa3a244d1154fc78d517cecfb295d2f07889d1259
 20K    asset.7382a0addb9f34974a1ea6c6c9b063882af874828f366f5c93b2b7b64db15c94
8.0K    asset.78b70ad373a624989fdc7740e7aa19700d82dfc386c4bc849803634716c8fa4a
4.4M    asset.aa10d0626ba6f3587e40157ecf0f5e0879088a68b2477bf0ef8eb74045a2439a
 30M    asset.bdb2015ec68b53161d29e5910113dcb0b789ba26659fcfdcddddf8256bde19ef.zip
8.0K    asset.be971704b52836a95da4dc35cbeb928f60b51bd5f7b01f03ac731e05cdfccbaf
8.0K    asset.dd5711540f04e06aa955d7f4862fc04e8cdea464cb590dae91ed2976bb78098e
4.0K    cdk.out
4.0K    integ.json
 88K    manifest.json
592K    tree.json
 57M    total

In this particular case, asset.aa10d0626ba6f3587e40157ecf0f5e0879088a68b2477bf0ef8eb74045a2439a is a 4.4MB NodeJS file… where the majority of that space must be used by imported libraries. Then the other one is asset.bdb2015ec68b53161d29e5910113dcb0b789ba26659fcfdcddddf8256bde19ef.zip which is the Kubectl/Helm package.

General thoughts on the Integ Runner

I think its an amazing tool … I wish it got more love. I’ve opened a bunch of issues on it over the last year (https://github.com/aws/aws-cdk/issues/27437, https://github.com/aws/aws-cdk/issues/22804, https://github.com/aws/aws-cdk/issues/22329, https://github.com/aws/aws-cdk/issues/27445, https://github.com/aws/aws-cdk/issues/28549)… they all kind of fall into the theme of better-documentation, better examples, and improved errors/warnings that help developers actually understand the root cause of failures.