runtime: Resolve inconsistency between \w and \b determination for what is a word character

Currently our definition of \b explicitly adds \u200c and \u200d beyond what \w represents. The TR18 report calls out that these should be part of word boundaries, but there’s ambiguity as to whether they should also be part of \w. We appear to have made an explicit choice at some point to not include these two additional chars in the definition of \w, but that seems like a strange inconsistency, one which we bend over backwards to maintain in a variety of places. We should reconsider whether this discrepancy continues to make sense, or whether it would make sense for these to also be considered a match for \w.

cc: @veanes

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

And I suppose there is no Unicode Category that is already imported into \w that could just take these extra two characters so as to not require ranges to define it?

Right. \w is exactly the combination of:

  • UppercaseLetter
  • LowercaseLetter
  • TitlecaseLetter
  • ModifierLetter
  • OtherLetter
  • NonSpacingMark
  • DecimalDigitNumber
  • ConnectorPunctuation

\u200c and \u200d are instead in the Format category, which today contains 38 other characters as well, so we can’t just add Format to the list.

The removes the \u200c and \u200d special-casing from \b such that it considers the identical set as \w?

I’ve gone back and forth, but I think if we’re going to make a change, we should do it “right”, and I think “right” would be making both of these word characters. I don’t see a good path to doing so for the foreseeable future, though. Given that, I think the least bad option is to just stick with what we currently have.

As such, I’ll close this. If we ever get negative feedback about it, and/or if we ever change our set representation to better enable this, we can revisit.

ICU includes \u200c and \u200d in the definition of \w: https://unicode-org.github.io/icu/userguide/strings/regexp.html

Word characters are [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d].

and TR18 states: http://www.unicode.org/reports/tr18/

To meet this requirement, an implementation shall extend the word boundary mechanism so that:
The class of <word_character> includes all the Alphabetic values from the Unicode character database, from [UnicodeData.txt](https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt), plus the decimals (General_Category=Decimal_Number, or equivalently Numeric_Type=Decimal), and the U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER (Join_Control=True). See also [Annex C: Compatibility Properties](http://www.unicode.org/reports/tr18/#Compatibility_Properties).

Java also includes it: https://regex101.com/r/OBF3hj/1

as does Rust: https://rustexp.lpil.uk/ image https://github.com/rust-lang/regex/blob/9ca3099037dcb2faf1b49e6493f4c758532f2da1/regex-syntax/src/unicode.rs#L322-L338 https://github.com/rust-lang/regex/blob/c31428ad59c1f9b13a1cc6dac0ac48e50eff512c/regex-syntax/src/unicode_tables/perl_word.rs#L307 Note as well this from Rust’s Unicode.md:

In particular, this differs slightly from the [prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries) but is permissible according to [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). Namely, it is convenient and simpler to have \w and \b be in sync with one another.

I’ll put up a PR.

TR18 is probably less important then what PCRE, Perl etc do. I can check.