runtime: CompareInfo.IsSuffix sometimes incorrectly splits collation elements and returns 'true'
Repro (.NET 5 on ICU):
var ci = CultureInfo.InvariantCulture.CompareInfo;
int idx = ci.IndexOf("\u0600x", "x", CompareOptions.None);
bool isSuffix = ci.IsSuffix("\u0600x", "x", CompareOptions.None);
Console.WriteLine($"idx = {idx}, isSuffix = {isSuffix}");
This sample prints idx = -1, isSuffix = True
. These return values are inconsistent with respect to each other. If IsSuffix returns true, then IndexOf must return a non-negative value. It’s the contrapositive of the code comment at:
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 17 (17 by maintainers)
I have looked deeply at this issue, and I am seeing the underlying problem is coming from ICU and not from our code.
When using ICU and comparing the two strings
x
and\u0600x
, it returns the result indicating the two strings are equal which is correct. The reason is the character\u0600
is ignorable character according to Unicode Standard. But when try to search forx
in the string\u0600x
using ICU APIs likeusearch_search
orusearch_last
, it returns-1
which is wrong.I have reported the issue to ICU through the ticket. I mentioned the details in the ticket and the possible root cause of this problem.
Note:
In .NET, the problem shows up when doing:
but will not show up if we do
The reason is, in the first case we search forward which cause the processing encounter
\u0600
first causing it to fallback to call ICU. In the second case we find the match (in our optimized code path) before hitting\u0600
and we’ll not call ICU at all in such case.I have wrongly assumed that at some point Windows is going to change the internals of NLS API to be implemented by using ICU inside (and hit the point they need to implement startswith|endswith).
One more idea is to use BreakIterators
note that I expect this will happen with other codepoints (e.g.
\u0601
too). I have some theory around these but let’s wait if anyone officially reply the bug.