meriyah: String literals are incorrectly parsed

subj

module.exports = '\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF';

here is the source https://raw.githubusercontent.com/zloirock/core-js/master/packages/core-js/internals/whitespaces.js

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 25 (20 by maintainers)

Most upvoted comments

I see multiple technical issues involved in this thread.

\u180e

‘\u180e’ is not a valid whitespace.

Correct. Historically when MONGOLIAN VOWEL SEPARATOR was introduced, it was categorized as Zs (whitespace), later in 2013 it was changed to Cf (ref: https://www.unicode.org/L2/L2013/13004-vowel-sep-change.pdf) and published in Unicode version 7.0.

Unfortunately such change will need decades to sync to every downstream projects of Unicode. So please file a bug on angular that \u180e should not be included in WS_CHARS.

weird looking texts on REPL

https://user-images.githubusercontent.com/1629088/75985511-bc242680-5eec-11ea-925b-35da4da05a12.png

There are three red dots in the parsed "value" key of the string literal. They represents \u2028, \u2029 and \ufeff respectively. Meriyah REPL uses CodeMirror to pretty print the AST, which uses a \u2022 (Bullet) to represent a “special char”. So a red dot is printed.

https://github.com/codemirror/CodeMirror/blob/01758b19565384414306816b43b5f35d81f039a3/src/line/line_data.js#L122

Note that when you copy from the AST, CodeMirror will send you the raw text, so you can compare it to the escaped version on your DevTools console (Yes, chrome DevTools also uses CodeMirror)

image

how it can break an app

I just want to build my angular app in ES5 as I have IE11-using customers. If I use meriyah, it breaks in this single and specific way.

I have no idea how a parser can break an app without generating the app code from the parsed AST. So I guess here is the process:

source-code => (meriyah) => (generator) => production-code

For example, astring is a generator that can print estree AST (generated by meriyah) to JavaScript codes. TypeScript has builtin parser and generator. One may also have their own generator.

In this case it can break the app because there are \u2028 \u2029 in the literal. When a generator is doing something like

`var ${decl.id.name} = "${decl.init.value}"`

The generated code will break on legacy platforms because \u2028, \u2029 must be escaped in string literals prior to ES2019 (https://ecma-international.org/ecma-262/#sec-intro). Since \u2028, and \u2029 are not printed as equivalent escaped form in decl.init.value, the generator may print the unescaped characters to the source.

To preserve the raw text of the string literal, you can pass raw: true to the meriyah option, which will append a "raw" property

"init": {
  "type": "Literal",
  "value": " \f\n\r\t\u000b ᠎ - 

   ",
  "raw": "' \\f\\n\\r\\t\\v\\u1680\\u180e\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff'"
}

The generator may print the string literal using decl.init.raw. If you are using your own generator, please revise and use decl.init.raw.

I’ll just make it clear as I found the original problem. All this stuff is borderline black magic so I think we all need to take a step back and appreciate for a second how hard this shit is and how big brainEd we all are. It’s basically computer science. Coming from a lowly angular developer.

I just want to build my angular app in ES5 as I have IE11-using customers. If I use meriyah, it breaks in this single and specific way. If I use ts, it builds fine but much slower. Can we focus on just solving this and moving forward pls