aws-cdk: (custom-resources): empty onEvent handler zip's being created, failing deploys
Describe the bug
We recently started to see our integration tests failing, even though deploys were succeeding. The failures on the integration tests look like this:
sent 1,788 bytes received 35 bytes 3,646.00 bytes/sec
total size is 1,680 speedup is 0.92
fatal: Not a valid object name integ
INFRA-MYAPP-ClusterTest: fail: ENOENT: no such file or directory, open '/home/runner/work/x/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip'
INFRA-MYAPP-ClusterTest: fail: ENOENT: no such file or directory, open '/home/runner/work/xxx/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.8e18eb5caccd2617fb76e648fa6a35dc0ece98c4681942bc6861f41afdff6a1b.zip'
INFRA-MYAPP-ClusterTest: fail: ENOENT: no such file or directory, open '/home/runner/work/xxx/xxx/test/integ/constructs/xyz/integ.cluster.ts.snapshot/asset.e2277687077a2abf9ae1af1cc9565e6715e2ebb62f79ec53aa75a1af9298f642.zip'
❌ Deployment failed: Error: Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
at Deployments.publishSingleAsset (/home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:11458)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async Object.publishAsset (/home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:151474)
at async /home/runner/work/xxx/xxx/node_modules/aws-cdk/lib/index.js:446:136916
Failed to publish asset a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5:current_account-current_region
FAILED integ/constructs/xyz/integ.cluster-IntegTest/DefaultTest (undefined/us-east-1) 29.[135](https://github.com/Nextdoor/xxx/actions/runs/6340275359/job/17221306412#step:11:136)s
Integration test failed: TypeError [ERR_STREAM_NULL_VALUES]: May not write null values to stream
When we then look in our S3 bucket, we find a series of 22 byte sized zip files. These three images are from three separate build attempts, all with fresh empty cdk.out directories, and all after we had wiped out the S3 cache files:
When we dug into it, it seems that these files are all related to the onEvent handlers for the custom-resource constructs. Going back in time a bit, it looks like these hash values show up at or around https://github.com/aws/aws-cdk/commit/a9ed64f2aa8014626857dfdfb33a823cd9cfd1fa#diff-8bf3c7acb1f51f01631ea642163612a520b448b843d7514dc31ccc6f140c0753…
Attempts to fix
Roll back to 2.90.0 - success
We tried to roll back to 2.87.0 - but our codebase would have required too many changes for that, so we were able to roll back to 2.90.0 though which is interestingly before several of the handlers were updated from Node16 to Node18.
When we rolled back to 2.90.0, the integration tests work fine.
Roll forward to 2.91.0 - success
Same as 2.90.0 - the tests work fine.
Roll forward to 2.92.0 - partial success
In https://github.com/aws/aws-cdk/releases/tag/v2.92.0, the custom-resources handler is bumped to use Node18 instead of Node16. That change creates the new asset hash a3f66c60067b06b5d9d00094e9e817ee39dd7cb5c315c8c254f5f3c571959ce5. This code mostly worked - however https://github.com/aws/aws-cdk/issues/26771 prevented us from fully testing the CDK construct for EKS.
Roll forward to 2.93.0 - success
In 2.93.0, we see the asset hash change from 3f579d6c1ab146cac713730c96809dd4a9c5d9750440fb835ab20fd6925e528c.zip -> 9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip. It seems that this release works just fine - though the tests are ongoing right now.
Roll forward to 2.94.0 - failure
It seems that the failure starts as soon as we hit the 2.94.0 release.
INFRA-MYAPP-ClusterTest: fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.9202bb21d52e07810fc1da0f6acf2dcb75a40a43a9a2efbcfc9ae39535c6260c.zip'
INFRA-MYAPP-ClusterTest: fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/infra-myapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.e2277687077a2abf9ae1af1cc9565e6715e2ebb62f79ec53aa75a1af9298f642.zip'
INFRA-MYAPP-ClusterTest: fail: ENOENT: no such file or directory, open '/home/runner/work/infra-myapp/inframyapp/test/integ/constructs/aws-eks/integ.xx-cluster.ts.snapshot/asset.8e18eb5caccd2617fb76e648fa6a35dc0ece98c4681942bc6861f41afdff6a1b.zip'
Rolling back to ‘2.93.9’ - success
Rolling back to 2.93.0 after the 2.94.0 failure immediately works… builds and integration tests pass again.
Expected Behavior
A few things here…
- I obviously don’t expect the zip files to be created empty and causing problems.
- I would expect the files are cleaned up or replaced when they are determined to be corrupt.
Current Behavior
As far as we can tell, once the corrupt file is created - there are some situations where it is uploaded to S3 (and thus poisoning the cache), and other situations where the upload fails to begin with.
Reproduction Steps
Working on this … don’t yet know exactly how to reproduce this
Possible Solution
No response
Additional Information/Context
No response
CDK CLI Version
2.95.0+
Framework Version
No response
Node.js Version
18
OS
Linux and OSX
Language
Typescript
Language Version
No response
Other information
No response
About this issue
- Original URL
- State: open
- Created 9 months ago
- Comments: 22 (10 by maintainers)
@mrgrain, So … first, thank you for taking the time to respond, I really do appreciate it. In reading your comments, you are definitely right that I missed the comment in the README and it’s pretty explicit (though I have an edge case I’ve run into, and I’ll comment separately from this one on it to see if you have any ideas). I think that the
integ-runnercode is really critical in larger CDK environments to execute realistic tests … so we’ve worked really hard to adopt it as a default in most of our CDK projects. In fact, we’ve built a full Github-Action/PR-based workflow where integration tests are run by users when they submit new PRs via some PR comments. With that type of a setup, it’s really critical that the tests are reliable, so that failures are truly related to the users PR itself.Improving the UX
Right off the bat … I think that if the
integ-runnercommand was able to proactively verify that the assets all existed before beginning a test, it would have dramatically improved things and probably saved me 10’s of hours of debugging and troubleshooting. I just imagine that it could use thelookuprole to verify that the resources exist (all the assets), and if any are missing, it could throw a big red error to tell the user that they can’t do an “update workflow test”. Thoughts?Snapshots in Git
I’m curious how you’ve seen this used in the past… when it comes to building small dedicated CDK libraries that do a specific thing, I can imagine that storing the full snapshots isn’t really a big deal … but in our case we’re launching integration tests to spin up real Kubernetes clusters along with a dozen different Lambda functions. These functions and the assets get pretty big:
Do you realistically see people storing these in Git - and updating them? Virtually every AWS CDK release makes changes to the Lambda handler code in some way, which causes new hash’s to be generated, causing new functions to be built and new assets to be created.
I’m not complaining … just trying to figure out what the realistic pattern is here. Our NodeJS functions aren’t too big - but we have a couple of Python functions that get big. For example:
In this particular case,
asset.aa10d0626ba6f3587e40157ecf0f5e0879088a68b2477bf0ef8eb74045a2439ais a 4.4MB NodeJS file… where the majority of that space must be used by imported libraries. Then the other one isasset.bdb2015ec68b53161d29e5910113dcb0b789ba26659fcfdcddddf8256bde19ef.zipwhich is the Kubectl/Helm package.General thoughts on the Integ Runner
I think its an amazing tool … I wish it got more love. I’ve opened a bunch of issues on it over the last year (https://github.com/aws/aws-cdk/issues/27437, https://github.com/aws/aws-cdk/issues/22804, https://github.com/aws/aws-cdk/issues/22329, https://github.com/aws/aws-cdk/issues/27445, https://github.com/aws/aws-cdk/issues/28549)… they all kind of fall into the theme of better-documentation, better examples, and improved errors/warnings that help developers actually understand the root cause of failures.