runtime: Regex incorrectly identify surrogate pair unicode category
This defect makes regex useless when you need to match string using unicode categories and string could contain surrogate pairs:
[Test]
public void ClassifySurrogates_Test()
{
var value = "𦕒";
var regexLetters = new Regex(@"^\p{Lo}$");
var regexSurrogates = new Regex(@"^\p{Cs}{2}$");
Assert.AreEqual(UnicodeCategory.OtherLetter, char.GetUnicodeCategory(value, 0));
Assert.AreEqual(UnicodeCategory.OtherLetter, CharUnicodeInfo.GetUnicodeCategory(value, 0));
Assert.True(regexSurrogates.IsMatch(value));
// Fails here
Assert.True(regexLetters.IsMatch(value));
}
About this issue
- Original URL
- State: open
- Created 8 years ago
- Comments: 36 (16 by maintainers)
This absolutely is a huge weakness in dotnet regex. An issue that I don’t have in JS, Python, or Java, for example.
Cool if you end up doing that it would be interesting to share back here how it went.
@danmoseley
I’m the author of the Rust regex engine and I’d be happy to answer questions. I’m not sure my experience will be terribly relevant here because the Rust regex engine doesn’t consider UTF-16 at all. It works on arbitrary byte sequences that are conventionally valid UTF-8. When Unicode mode is enabled (the
u
flag, the default), then it treats the codepoint, not the code unit, as the fundamental atom of matching. Now internally, it still does its search one byte at a time, and achieves the “match one codepoint at a time” property by constructing UTF-8 automata. (Which I imagine are far more complex than what you’d need for UTF-16.) But you might not need automata at all. It might “just” require moving your internal engines from working one code unit at a time to one codepoint at a time. Of course, that is likely very much easier said than done.Anyway, I’ll stop there, but I’d be happy to brainstorm.
The representation is a UTF-16
ReadOnlySpan<char>
. The implementation today looks at individual char (16-bit) values.Backtracking requires the ability to randomly access data in the input, but only to locations previously seen. There are also optimizations that utilize random access to jump forwards, such as jumping to some offset from the end of the input in some cases involving end anchors. Etc.
If we were to do anything here, we’d want to compute as much as possible as part of the construction/codegen phase, e.g. as part of the Regex ctor / parsing / tree optimization and/or as part of the source generator. It would also very likely need to be opt-in, such as with a new RegexOptions flag, both for compatibility and for performance. We already use RegexOptions to impact how parsing / tree optimization is performed, e.g. RegexOptions.IgnoreCase causes individual characters in the pattern to be translated at construction/parsing time into the corresponding case-insensitive sets. We’d likely do some amount of similar transformation as part of constructing the tree. We’d also probably need a new representation for known categories like
\w
, which would then allow matching and code generation to continue current optimizations for such sets and making the more expensive Unicode matching pay-for-play.Makes sense. Thanks for the feedback. Once we get to that point on our project, we can discuss option (3) contribute surrogate pair support to dotnet. I’m not sure if this is likely, but if we think it seems like a good idea, I’ll get back in touch here. Feel free to do with this issue as you see fit meanwhile.
I see, thanks for sharing. Our implementation of Regex does assume we are operating over single characters in a lot of places, and no one has really researched on what it would take to add support for surrogates (but I do expect it would likely be a lot of work). In order to move forward, we would need someone to design it and to ensure that we can do it in a way that it won’t regress any existing case (as we only really expect surrogates to be less than the 1% case). For anyone interested in coming up with this proposal, I would be happy to review it, but this is not one of our current priorities.