jsii: Imports of large Python library (aws-cdk-lib) extremely slow

šŸ› Bug Report

Affected Languages

  • TypeScript or Javascript
  • Python
  • Java
  • .NET (C#, F#, ā€¦)
  • Go

General Information

  • JSII Version: N/A
  • Platform: MacOS

What is the problem?

Originally reported as https://github.com/aws/aws-cdk/issues/19000


I noticed that simple commands such as cdk ls would take very long (sometimes above 20 seconds, rarely much less) to complete. This in a project created using cdk init sample-app --language python.

Reproduction Steps

# setup:
$ cdk init sample-app --language python
$ source .venv/bin/activate
$ pip install -r requirements.txt
# reproducing issue:
$ time python -c "import aws_cdk as cdk"
python -c "import aws_cdk as cdk"  5.74s user 4.20s system 58% cpu 17.113 total

What did you expect to happen?

A simple import should not take longer than 1 second (ideally even less, but there are constrains that make that diffucult here I understand).

What actually happened?

As seen above, it takes ~17 seconds. This is quite consistent, however first or second time after initing the project, it can take 20-25 seconds.

CDK CLI Version

2.12.0 (build c9786db)

Framework Version

Node.js Version

v16.14.0

OS

MacOS 11.6.2 (20G314) - Intel Macbook

Language

Python

Language Version

Python (3.8.9, 3.9.10)

Other information

To better understand where the time went, I debugged the issue with some print statements:

āÆ time python -c "import aws_cdk as cdk"
process.py start: 0.00
self._process.args=['node', '--max-old-space-size=4069', '/var/folders/9j/cszr_w557ns8q4bk2ctzygx40000gp/T/tmp9_s6gne5/bin/jsii-runtime.js']
>>> b'{"hello":"@jsii/runtime@1.54.0"}\n'
process.py before self._next_message(): 0.25
>>> b'{"ok":{"assembly":"constructs","types":10}}\n'
process.py after self._next_message(): 0.33
process.py before self._next_message(): 0.33
>>> b'{"ok":{"assembly":"aws-cdk-lib","types":7639}}\n'
process.py after self._next_message(): 12.95
python -c "import aws_cdk as cdk"  5.72s user 4.21s system 58% cpu 16.923 total

The offending method seems to be (from Pythonā€™s point of view) jsii/_kernel/providers/process.py: _NodeProcess.send

I understand that this communicates with a node process, so I assume it to be something going on there that takes very long.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 10
  • Comments: 33 (8 by maintainers)

Commits related to this issue

Most upvoted comments

We spent some time debugging the CDK load process. It is somewhat unwieldy due to two major issues/design decisions that contribute to the problem. I will outline it here for the general public, as I assume that AWS people decided on these tradeoffs knowingly.

The first issue is with the JSII distribution itself. Unlike python modules, where the source code is already present in the python library directory, JSII package comes as a tarball containing the typescript files. On each run of the python program, the tarball is extracted to a temporary directory and executed from the temp dir. This is further slowed down by the fact that the file is gzipped, so CPU time must be expended to decompress. Finally, the decompression and unpacking is performed in javascript, which is much slower that the native ā€˜tarā€™ command.

We see no user benefit in having the tarball decompressed on each program run. Instead, we worked around this by unpacking the files once and re-using the same path for each application run. Getting rid of the repeated unpacking already saves almost half of the load time ā€“ we were able to get to about 4 seconds instead of 7.

The other issue is with the fact that aws_cdk is now packaged as myriad modules for each of the aws service (EC2, SSM, ā€¦) altogether. When the JSII initializes, it loads and interprets all the code in one go, as opposed to lazy loading of what is actually required. This would be advantageous if majority of the modules were actually required, but in a realistic project only a handful of the 200+ modules is used.

This results in a terrible waste as we are actually loading thousands of callable methods that are never called. Unfortunately, working around that isnā€™t very easy as it requires editing the huge (50MB+) .jsii manifest which we couldnā€™t do by hand. However, we suspect that this factor is responsible for majority of the rest of the load times.

Iā€™d like to stress that I find it not acceptable to add 7+ seconds to any programā€™s load time without a very good justification. It strikes me as odd that this was not noticed or given weight during internal testing. I do like to work with CDK and I am open to discussing possible solutions, rather sooner than later.

I have dug deeper on this subject and found out that (sample size = 1):

  • ( 65%) 3.331s => tar.extract (while macOSā€™ tar zxf takes 1.764s)
  • ( 13%) 0.695s => loadAssemblyFromPath (mostly from parsing a massive JSON document)
  • ( 21%) 1.106s => require (consistent with node -e 'require("aws-cdk-lib")')
  • ā€¦ some milliseconds here and there of other stuff ā€¦
  • (100%) 5.160s Total

From this perspective:

  1. Un-tar is definitely a top contributor (gunzip is however not, I ran experiments and decompressing the tarball ahead of time does not make much of a difference)
  2. The very large jsii assembly processing is heavy, but pivoting to an alternate format can easily become a breaking changeā€¦ it can be improved, but itā€™ll take time & care
  3. The library takes a solid second to load in ā€œpure Nodeā€ due to the sheer amount of code that needs parsing.

Specifically on some of @du291 claims (by the way, thank you for the detailed writeup):

This is further slowed down by the fact that the file is gzipped, so CPU time must be expended to decompress.

My experimentations demonstrated that the gz pass is immaterial when using macOSā€™ tar command (1.764s vs 1.720s, so 44ms - less than 2%). I have a sample size of 1 however, and tested this on only one platform. If you have data from other platforms/environment where the gap is more substantial, please let me know so I can see to add them to my benchmark posture.

Finally, the decompression and unpacking is performed in javascript, which is much slower that the native ā€˜tarā€™ command.

I initially did not believe the difference would be big enough to be material. When we chose this mechanism about 4/5 years ago, we had run some benchmarks and the npm library performed similar to the tar commandā€¦ But it turns out at that time, the packages were much smallerā€¦ and it turns out the JS version is about 90% slower.

We see no user benefit in having the tarball decompressed on each program run.

Unpacking on every run was deemed safer (guarantees the files on-disk have not been tampered with, removes some race condition risks, etcā€¦). It also consumes less disk space at rest.

Instead, we worked around this by unpacking the files once and re-using the same path for each application run.

Given the research above that confirms your findings, I guess we can easily improve the situation a lot by doing something similarā€¦ I donā€™t know how you got around to do your deed here, but if it makes sense & youā€™re able and willing to file a PR with what you haveā€¦ this might give us a headstart.

This would be advantageous if majority of the modules were actually required, but in a realistic project only a handful of the 200+ modules is used.

This is correct. I would note though that in this particular area, using TypeScript gives you no edge. The way aws-cdk-lib is architected means you canā€™t quite avoid loading the entire library (since App, Stack, etcā€¦ are at the package root, and loading that causes all other submodules to be loaded as well).

Iā€™d like to stress that I find it not acceptable to add 7+ seconds to any programā€™s load time without a very good justification.

Itā€™s obviously hard to disagree with this statement. In an ideal world, the load performance would be identical (in the same ballpark) regardless of your language of choice (after all, the premise of jsii is that you should be free to choose your language independently of other considerations), and I will treat any excessive ā€œjsii taxā€ as a defect.

Immediate next steps here are:

  • Short term: Remove tar.extract from the hot path
  • The CDK team is working on reducing the module size (RFC 39)ā€¦ we may consider stretching the scope of this work with tactical items that could improve the require time for aws-cdk-lib across all languages if we can identify opportunities there

Longer term (these are non-trivial and risk incurring breaking changes):

  • Design a strategy that would allow avoiding eager loading of the entire library
  • Remove the need to load/parse the jsii assembly file at runtime

No, still slow. Iā€™m currently running these commands Linux VM (hosted in Google Cloud, canā€™t award AWS for this šŸ˜„).

Yes, I agree, most of the problems are heavily magnified by the library size. I think jsii has hit a scaling problem-- the discussion should be, do we want to ship all these 200+ modules together? Then we need to massively optimize itā€¦ Or do we want people to hand pick the 10 they use, like in CDK1, and then none of that mattersā€¦ I only speak for myself, I didnā€™t mind having 10 imports and pip requirements.

To be fair, part of the issue is the CDK is a super package at both the package and the code level. The CDK could still be a ā€œsuperā€ package (e.g., only one pip/npm install to get all the things you need), but not be tightly linked together like it is today (import what you need only). The download size would still be large, but JSII compression and the proposed lambda layer changes make that more tolerable as a one-time cost.

Focusing on Pythonā€¦ suing cProfile and pstats (on my laptop), it appears to have this suspicious entry:

Tue Feb 22 15:50:17 2022    profile

         1908213 function calls (1862684 primitive calls) in 6.822 seconds

   Ordered by: internal time
   List reduced from 9880 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        6    4.863    0.811    4.863    0.811 {method 'readline' of '_io.BufferedReader' objects}

That is ~70% of the runtime accounted by reading from the child processā€™ output streamā€¦ That might be time spent waiting for the node process to return dataā€¦ but I guess the python-side profile doesnā€™t quite tell the full picture on this.