runtime: CompareInfo.IsSuffix sometimes incorrectly splits collation elements and returns 'true'

Repro (.NET 5 on ICU):

var ci = CultureInfo.InvariantCulture.CompareInfo;
int idx = ci.IndexOf("\u0600x", "x", CompareOptions.None);
bool isSuffix = ci.IsSuffix("\u0600x", "x", CompareOptions.None);
Console.WriteLine($"idx = {idx}, isSuffix = {isSuffix}");

This sample prints idx = -1, isSuffix = True. These return values are inconsistent with respect to each other. If IsSuffix returns true, then IndexOf must return a non-negative value. It’s the contrapositive of the code comment at:

https://github.com/dotnet/runtime/blob/eb187d1250cf65c16442c207ff97da65690afae5/src/libraries/System.Private.CoreLib/src/System/String.Searching.cs#L270-L273

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

I have looked deeply at this issue, and I am seeing the underlying problem is coming from ICU and not from our code.

When using ICU and comparing the two strings x and \u0600x, it returns the result indicating the two strings are equal which is correct. The reason is the character \u0600 is ignorable character according to Unicode Standard. But when try to search for x in the string \u0600x using ICU APIs like usearch_search or usearch_last, it returns -1 which is wrong.

I have reported the issue to ICU through the ticket. I mentioned the details in the ticket and the possible root cause of this problem.

Note:

In .NET, the problem shows up when doing:

int idx = ci.IndexOf("\u0600x", "x", CompareOptions.None);

but will not show up if we do

int idx = ci.LastIndexOf("\u0600x", "x", CompareOptions.None);

The reason is, in the first case we search forward which cause the processing encounter \u0600 first causing it to fallback to call ICU. In the second case we find the match (in our optimized code path) before hitting \u0600 and we’ll not call ICU at all in such case.

I have wrongly assumed that at some point Windows is going to change the internals of NLS API to be implemented by using ICU inside (and hit the point they need to implement startswith|endswith).

One more idea is to use BreakIterators

note that I expect this will happen with other codepoints (e.g. \u0601 too). I have some theory around these but let’s wait if anyone officially reply the bug.