runtime: Resolve inconsistency between \w and \b determination for what is a word character
Currently our definition of \b explicitly adds \u200c and \u200d beyond what \w represents. The TR18 report calls out that these should be part of word boundaries, but there’s ambiguity as to whether they should also be part of \w. We appear to have made an explicit choice at some point to not include these two additional chars in the definition of \w, but that seems like a strange inconsistency, one which we bend over backwards to maintain in a variety of places. We should reconsider whether this discrepancy continues to make sense, or whether it would make sense for these to also be considered a match for \w.
cc: @veanes
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (20 by maintainers)
Right. \w is exactly the combination of:
UppercaseLetterLowercaseLetterTitlecaseLetterModifierLetterOtherLetterNonSpacingMarkDecimalDigitNumberConnectorPunctuation\u200cand\u200dare instead in theFormatcategory, which today contains 38 other characters as well, so we can’t just addFormatto the list.I’ve gone back and forth, but I think if we’re going to make a change, we should do it “right”, and I think “right” would be making both of these word characters. I don’t see a good path to doing so for the foreseeable future, though. Given that, I think the least bad option is to just stick with what we currently have.
As such, I’ll close this. If we ever get negative feedback about it, and/or if we ever change our set representation to better enable this, we can revisit.
ICU includes \u200c and \u200d in the definition of \w: https://unicode-org.github.io/icu/userguide/strings/regexp.html
and TR18 states: http://www.unicode.org/reports/tr18/
Java also includes it: https://regex101.com/r/OBF3hj/1
as does Rust: https://rustexp.lpil.uk/
https://github.com/rust-lang/regex/blob/9ca3099037dcb2faf1b49e6493f4c758532f2da1/regex-syntax/src/unicode.rs#L322-L338
https://github.com/rust-lang/regex/blob/c31428ad59c1f9b13a1cc6dac0ac48e50eff512c/regex-syntax/src/unicode_tables/perl_word.rs#L307
Note as well this from Rust’s Unicode.md:
I’ll put up a PR.
TR18 is probably less important then what PCRE, Perl etc do. I can check.