babel: babel-register cache grows infinitely and breaks v8

Choose one: is this a bug report or feature request? a bug

Expected Behavior

By default, babel-register creates a cache in the user’s home directory, .babel.json. This cache appears to be unmanaged based on looking at ./babel-register/lib/cache.js. The cache should manage itself to avoid growing to an extremely large size.

Current Behavior

I started experiencing v8 crashes when running mocha tests using --compilers js:babel-core/register as below:

<--- Last few GCs --->

   82518 ms: Mark-sweep 807.1 (1039.7) -> 802.3 (1038.7) MB, 149.2 / 0.0 ms [allocation failure] [GC in old space requested].
   82668 ms: Mark-sweep 802.3 (1038.7) -> 802.3 (1036.7) MB, 150.6 / 0.0 ms [allocation failure] [GC in old space requested].
   82838 ms: Mark-sweep 802.3 (1036.7) -> 802.2 (993.7) MB, 169.7 / 0.0 ms [last resort gc].
   82989 ms: Mark-sweep 802.2 (993.7) -> 802.2 (982.7) MB, 150.6 / 0.0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0000024EE58CFB61 <JS Object>
    1: SparseJoinWithSeparatorJS(aka SparseJoinWithSeparatorJS) [native array.js:~75] [pc=000002B8298FC057] (this=0000024EE5804381 <undefined>,w=0000011715C4D061 <JS Array[7440]>,F=000003681BBC8B19 <JS Array[7440]>,x=7440,I=0000024EE58B46F1 <JS Function ConvertToString (SharedFunctionInfo 0000024EE5852DC9)>,J=000003681BBC8AD9 <String[4]\: ,\n  >)
    2: DoJoin(aka DoJoin) [native array.js:137...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

I traced these to babel-register/lib/cache.js code calling JSON.stringify on the cache object.

  try {
    serialised = (0, _stringify2.default)(data, null, "  ");
  } catch (err) {

My .cache.json was over 200 megabytes. Deleting it immediately resolved the problem.

Possible Solution

  • cache should periodically expire old things and have a maximum size
  • cache could be implemented using some kind of simple database that’s more efficient than reading the entire cache into memory & rewriting it at the end of a session

Context

Prevents inline transpilation from working properly, and performance suffers significantly as the cache size grows and each operation requires reading/writing a huge file.

Because it’s very difficult to trace the source of v8 crashes, this is a rather insidious bug. There is at least one other bug report in a random package that is almost certainly this issue:

https://github.com/caolan/async/issues/1311

This would primarily become an issue for people running large test suites using babel-register in a single environment that is never purged (e.g. a dev workstation). I expect even though it may not manifest very often, there are certainly performance and stability implications for a large number of users for never pruning the cache.

Your Environment

Windows 10 Node 6.10.2 Npm 4.2.0 Babel 6.18.2

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 10
  • Comments: 31 (14 by maintainers)

Most upvoted comments

@liuxingbaoyu

I tried to run your sample code with the value a being a large object instead of a long string. It turns out JSON.strigify has better performance this time.

const v8 = require('v8');

console.time("init");
var l = Math.pow(2, 22) - 16 - 1;
var a = Array.from({ length: l }, () => ({ x: 0 }));
v8.deserialize(v8.serialize(a)); // Force the object to be initialized.
console.timeEnd("init");

global?.gc()
console.log("used_heap_size", v8.getHeapStatistics().used_heap_size / 1024 / 1024);

console.time("v8");
v8.deserialize(v8.serialize(a));
console.timeEnd("v8");

global?.gc()

console.time("JSON");
JSON.parse(JSON.stringify(a));
console.timeEnd("JSON");

node --expose_gc main.js

init: 4.638s
used_heap_size 466.04193115234375
v8: 3.916s
JSON: 1.736s

v8.serialize is implemented using Buffer and does not belong to the heap memory of v8.

When serializing 255mb of text with the --max-old-space-size 300 parameter.

v8.serialize works fine, while JSON.stringify ooms.

255mb cache is big enough, so it’s a good short term solution.

Unless we are going to rewrite a caching system soon.

FYI in babel 7, the cache should go in node_modules/.cache now via findCacheDir (as mentioned above) so it would be per directory

@jamietre There are definitely a number of issues with the caching as it currently exists. I’m not sure if this is why you’re running into this issue, but using babel 6, all projects, run in all environments (NODE_ENV=mocha/development/production) will share a single file. Splitting up the files by environment happened here: https://github.com/babel/babel/pull/5411, and using a location specific to each project was added here: https://github.com/babel/babel/pull/5669. These should “solve” the issue in practice (For example at my job we have a ton of modules, some that are very large, and these fixes solved the immediate issue without having to delete .babel.json every so often). These are of course just stopgaps and won’t actually completely address the issue. You’ll still be able to recreate if you really try.

I get the impression there probably won’t be much work on improving the cache in any major way until a decision about how to unify it with the babel-loader caching, and I think there is some desire to standardize around a caching strategy that can be used by other open source libs like ava. Here’s some background https://github.com/babel/babel/issues/5372.

In the short term, here’s what we’ve done at work to stop the bleeding…turns out deleting your .babel.json file every couple days wasn’t a satisfactory suggestion for most folks 😉

my-babel-register.js

const env = process.env.BABEL_ENV || process.env.NODE_ENV || 'development';
process.env.BABEL_CACHE_PATH = process.env.BABEL_CACHE_PATH || findCacheDir({ name: 'babel-register', thunk: true })(`${env}.json`);
require('babel-register')({
    ...
});

Then just call this file instead of babel-register.

Good luck!

const v8 = require('v8')

console.time("init");
var a = '"'.repeat(Math.pow(2, 28) - 16 - 1);
var b = 'a'.repeat(Math.pow(2, 28) - 16 - 1);
v8.deserialize(v8.serialize(a)); //Force the string to be initialized.
v8.deserialize(v8.serialize(b));
console.timeEnd("init");

global?.gc()
console.log("used_heap_size",v8.getHeapStatistics().used_heap_size/1024/1024);

console.time("v8");
v8.deserialize(v8.serialize(a));
console.timeEnd("v8");

global?.gc()

console.time("JSON");
JSON.parse(JSON.stringify(a));
console.timeEnd("JSON");

global?.gc()

console.time("v8");
v8.deserialize(v8.serialize(b));
console.timeEnd("v8");

global?.gc()

console.time("JSON");
JSON.parse(JSON.stringify(b));
console.timeEnd("JSON");

node --expose_gc main.js

init: 627.644ms
used_heap_size 515.4688186645508
v8: 257.494ms
JSON: 4.113s
v8: 267.619ms
JSON: 1.808s

It looks like the performance boost is amazing!

😃

Since the state of the @babel/register cache is not going to get any better any time soon, and our project is largely depending on @babel/register performance, I took the time today to overhaul the caching system. It works well even with hundreds of transformed files, and in my first experiments runtime went from several minutes down to several seconds.

Code

WARNING: This is not a proper fork. I know. Bad. Please see notes below.

How to test it?

The quick-and-dirty approach is to:

  • make sure you have the same version as the one in the package.json
  • open the node_modules/@babel/register/lib folder
  • Replace node.js and cache.js.
  • Run it, test it and leave feedback here.

Of course this is just a hacky, temporary solution. If there is interest, I can put together a patch and even a small patch script.

Some notes

  1. It now saves each file’s transform output individually (in a better code-readable format than just json).
  2. Cache directory also contains env (via the undocumented babel.getEnv)
  3. Relative filepath is the same as the original file. This would also make it possible to move cached files between systems (addressing one of @pwmckenna 's concerns).
  4. File also contains cacheKey for validation.
  5. I added some cache miss debugging options. If cache miss is due to different options, it even uses some naive heuristics to make more obivous how they are different. If the team likes it, we can better formalize it, and make it customizable via opts and/or env.

Big warning

I am in a hurry, so I just copy+pasted a mix of original (src) and compiled (lib) source code files, and went of off that. I did not want to work myself through the whole build process, and it was also important that it is available within our project ASAP, thus the bad copy-and-paste decision. This means:

  • no FORK
  • no tests
  • uglified code (due to mix of lib and src code; will need to touch it up a bit before the PR)

However, if someone helps with setting up a FORK and adds tests, and the team agrees with the approach, I am sure we can get the PR out within an hour.

Any feedback is welcome.

I would hope it’s immediately obvious to anyone that one big fat json file is not a viable long-term strategy for fast startup

@hzoo My job has given me the rest of the week to upgrade to babel7 and try to improve the caching for our use cases. Outside of just creating benchmarks and actually implementing a faster cache, are there things that we need to keep in mind? I know there was some talk of using the same caching logic as another project (ava or jest or other?). Is there still a desire to share that logic or is it acceptable to have a babel specific implementation? Currently these are the things I’m going to try:

  1. file by file cache instead of one big file. I made a pr for this that I eventually closed. Going to start with benchmarks this time so we can see how it performs. Its also not clear that the caching tradeoffs that make sense for our project make sense for everyone so it might make sense to do a custom cache (below)
  2. custom cache implementations. I also had a pr that didn’t go anywhere to allow for custom caches. Might make sense if we could consume a cache that had the same api as Map? If we did this first I could implement.
  3. use file path as a hint, but use file hash as the source of truth We build in a seperate docker container than we deploy to, but they’re perfectly compatible. It would be nice to warm the cache, then copy it over. The only hitch so far is that the absolute file paths don’t match, but the project relative paths do. (this might be outdated if something has changed since I last looking into babel’s caching).

I’d love some feedback on these ideas. We’re hoping to get something done and submitted for review by the end of the week. Thanks!