highlight.js: (parser) Improper tokenization with variable names in languages that use Unicode / UTF-8

Taking over this issue to track the overall process on this. Original issue is below.

Prework:

#2759 Update all our regexe to be UTF8 compliant (10.4)
Add Grammar#unicodeRegex to allow per-language UTF8 regex (11.x)
Performance testing - /u seems MUCH slower in my initial testing. (10.4)

Individual languages may now add support if they don’t trigger performance regressions:

JavaScript
TypeScript (I presume?)
All languages depending on ecmascript.js will need to be reviewed (that’s where the Javascript-like IDENT_RE is now)
Java
C
XML #3256
C++
R
Ruby
Python
Go
Rust*
Kotlin
C#
Swift

Perhaps one day:

Flip the parser to always using u by default (10.4) [needs resolution of performance slowdowns]
Ensure this change results in no overall regressions (10.4)
Consider MODES IDENT_RE and UNDERSCORE_IDENT_RE, although I think they should not change.

Before any of this is possible we’ll need to start building regexs with u mode. After that it should simply be a matter of altering the appropriate IDENT_REs for the language in question. Actually upon second though we do have IDENT_RE but I’m not confident that it should change (vs this being a per language opt-in)… perhaps an IDENT_RE_UTF8?

A few markup tests should be added to 2 or 3 of the languages (or perhaps unit tests if we add a new mode) to make sure the behavior is as expected.

Original issue

Describe the issue Most modern languages use the Unicode spec to decide what constitutes a proper variable name. Thus, hangul filler characters, even though they seem to be whitespace, are valid variable names. The variable name thisﾠisﾠaﾠvariableﾠname is a valid variable, although hl.js tokenizes the first word or so as separate from the rest. In highlighting systems that are aware of this, like Chrome’s dev tools, it is parsed correctly. The first snippet uses hangul fillers while the second one uses whitespace (which is why it throws an error):

Which language seems to have the issue? Every language that follows the Unicode spec for variable names theoretically. I’ve only tested the ones mentioned in the title, but it might make more sense to just tokenize every language as if it were one of those that followed the unicode spec since this should be the default.

Are you using highlight or highlightAuto? N/A

Sample Code to Reproduce N/A

Expected behavior hl.js should be using ID_Start and ID_Continue for the languages that use it, or at least use it as a default. The regex to match this is rather large, but it can be shortened significantly in regex engines supporting the u flag (modern JavaScript)

const matchIDStart = char => /\p{ID_Start}/u.test(char);
const matchIDContinue = char => /\p{ID_Continue}/u.test(char);

\p{ID_Start} matches any valid starting character in a variabe name \p{ID_Continue} matches any valid non-starting character in a variable name (such as numbers and some variation selectors)

About this issue

Original URL
State: open
Created 4 years ago
Comments: 21 (16 by maintainers)

Commits related to this issue

(chore) Clean up all regexs to be UTF-8 compliant/ready (#2759) Work toward #2756. Cleans up a lot of incorrect (unnecessary escaped) regex that would not compile with the `u` flag. After that mak... — committed to highlightjs/highlight.js by joshgoebel 4 years ago

Most upvoted comments

languageDetectRe

Yep, I’d say that would fix your issue.

joshgoebel on Feb 22, 2022

You’ll likely need to edit keywords/$pattern also. (See docs). The default pattern only matches ASCII keywords.

joshgoebel on Feb 21, 2022

As an aside, if there were some convenient way to switch all regex to use ID_Start and ID_Continue, I’m not sure it would break much, even in languages which require only ASCII-idents, ID_Start still doesn’t match for spaces or equals signs, so it’s rare that it would match too much.

Mostly it would all have to be done by hand… so if no one is having issues with a given language then I don’t see any harm in languages keeping their older [a-z] style rules… and then when (or if) it DOES provide to be a problem and we have someone on the horn who knows what the rules are for language xyz, then we can deal with them on a one off basis.

I think the first step here would be to slowly clean up our regexs to be UTF8 compatible if possible (avoiding the need for per language flags)… and then at that point it could be decide which languages needed new regex rules to take advantage of what the u flag has to offer.

joshgoebel on Oct 16, 2020