aws-cdk: cli: Socket timed out without establishing a connection when --asset-parallelism=true
Describe the bug
I have anywhere between 20-50 nodejs lambda functions in single stack and I update their dependencies and deploy with cdk.
But lately I am not able to deploy updates. I get following error when I deploy.
current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-lookup-role-******-us-east-1', but are for the right account. Proceeding anyway.
(To get rid of this warning, please upgrade to bootstrap version >= 8)
current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-file-publishing-role-******-us-east-1', but are for the right account. Proceeding anyway.
current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-file-publishing-role-******-us-east-1', but are for the right account. Proceeding anyway.
[9%] fail: Socket timed out without establishing a connection
[18%] fail: Socket timed out without establishing a connection
I keep trying again and again and sometimes it goes through and most of the time it doesn’t work. Only stack with lower number of lambda functions sometimes gets deployed. But stack with large number of lambda functions fails 100% of the time.
Expected Behavior
I expected it to deploy no matter number of lambda functions in the stack. It used to get deployed without any problem.
Current Behavior
current credentials could not be used to assume 'arn:aws:iam::******:role/cdk-hnb659fds-lookup-role-******-us-east-1', but are for the right account. Proceeding anyway.
(To get rid of this warning, please upgrade to bootstrap version >= 8)
I don’t know how to upgrade bootstrap version. I ran cdk bootstrap multiple times and it says no changes.
Reproduction Steps
const testSignUpFn = new NodejsFunction(this, 'testSignUpNodeJS', {
runtime: Runtime.NODEJS_14_X,
entry: `${__dirname}/../lambda-fns/sign-up/index.ts`,
handler: 'signUp',
architecture: Architecture.ARM_64,
memorySize: 1024
})
It was working before but suddenly stopped working.
Possible Solution
No response
Additional Information/Context
No response
CDK CLI Version
2.20.0 (build 738ef49)
Framework Version
No response
Node.js Version
v16.14.2
OS
Ubuntu 20.04 on WSL 2
Language
Typescript
Language Version
~3.9.7
Other information
No response
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 3
- Comments: 18 (2 by maintainers)
This does appear to be related to the Asset Parallelism feature. Executing a deployment with –asset-parallelism=false resulted in a successful deployment.
When running without –asset-parallelism=false the stack failed on the following error:
Call failed: listObjectsV2({"Bucket":"cdk-hnb659fds-assets-ACCOUNT_ID-eu-west-2","Prefix":"0936406e22fea26017ecca536fcbdc550936406e22fea26017ecca536fcbdc55.zip","MaxKeys":1}) => Socket timed out without establishing a connection (code=TimeoutError)There are only four assets in the bucket and none of them are over 50KB.
System information: OS: Ubuntu 20.04 NodeJS Version: v16.3.0 CDK verison: 2.51.1
Turns out our issue was caused by setting
NODE_OPTIONS=--enable-source-mapsin our deployment pipeline.CDK is compiled into a single 28 MB
.jsfile, accompanied with a 58 MB source map. This causes excessive load, especially due to the high parallelism that CDK uses. I have patched out all the unqueued IO processes and replaced all the hardcoded parallelization values withrequire("os").cpus().length. This resolved our timeouts and we were able to deploy again.Soon after, we realized that deployment performance was dramatically improved by upgrading to Node@20. This is due to this change in Node@19.6. Previously, we ran Node@18 LTS, which was also the highest supported version of CDK at the time. This change in Node@19.6 introduces caching for the parsed source maps, which resolves this whole problem entirely (for us).
I stand by my point that the way CDK handles IO is ridiculous. I also think bundling a NodeJS module into a single 28 MB file, with a 58 MB source map is ridiculous.
As Node@18 is also the latest supported runtime by AWS Lambda, be cautious when using
--enable-source-mapsat runtime, because similar performance issues can be observed there, especially during exception handling.p.s.: The reason it worked for us locally was, that nobody set
--enable-source-mapslocally, or people were already on Node@20 locally.Facing with this issue regularly now on 2.39.1 When I enable vpn and deploy again this error disappear so looks like this is somehow related to connection establishing issue.
We conducted further research into this. It seems like what CDK calls “parallelism” is just waiting for multiple promises on the same single thread, there is no work happening in parallel at all. This is combined with the extremely poor single-core performance of the GitHub Actions runner fleet, and you end up with a fully saturated core for the entire runtime of your pipeline, regardless of how many cores you give it.
When I asked AWS reps about this, they told me that using the public runner fleet is a bad choice to begin with. You probably want to invest in some fat self-hosted runner with a single 5GHz core.
I’m pressing our client to move away from CDK ASAP, but we will likely solve this problem with money in the mid-term. This is not a good product.
We also have to use the
--asset-parallelism=falseworkaround to be able to deploy at all. With 2.83, a new parallelism feature was introduced to improve performance. Now our deployments are entirely broken, regardless of--asset-parallelism.In general, a real solution for the underlying issue would be appreciated.
In case it helps, we only see the problematic behavior when deploying from GitHub Actions. If we run the same deploy locally, it completes dramatically faster and without issues. So far, all our research regarding environment differences have been fruitless.