runtime: Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms

Proposal

.NET Core provides APIs to compare strings for ordinal case-insensitive equality (such as via StringComparer.OrdinalIgnoreCase). The current implementation of this API is to call ToUpperInvariant on each string, then compare the resulting uppercase strings for bitwise equality.

.NET Core also provides methods to convert chars, Runes, and strings to uppercase or lowercase using the “invariant” culture (ToUpperInvariant / ToLowerInvariant). The current implementation of this API is to p/invoke NLS on Windows or ICU on non-Windows.

I propose changing the logic so that .NET Core carries its own copy of the ICU “simple” casing tables, and we consult our copies of those tables on all operating systems. This would affect string comparison only when using an OrdinalIgnoreCase comparison, and it would affect string casing only when using CultureInfo.InvariantCulture.

Justification

Today, when processing UTF-8 data in a case-insensitive manner (such as via Equals(..., OrdinalIgnoreCase)), we must first transcode the data to UTF-16 so that it can go through the normal p/invoke routines. This transcoding and p/invoke adds unnecessary overhead. With this proposal, we’d be able to consult our own local copies of the casing table, which eliminates much of this overhead and streamlines the comparison process. This performance boost should also be applicable to existing UTF-16 APIs such as string.ToUpperInvariant and string.ToLowerInvariant since we’d be able to optimize those calls.

As mentioned earlier, today’s casing tables involve p/invoking NLS or ICU, depending on platform. This means that comparison / invariant casing APIs could provide different results on different operating systems. Even within the same operating system family, the casing tables can change based on OS version. (Windows 10 1703 has different casing tables than Windows 10 1903, for instance.)

Here are some concrete examples demonstrating the problems:

// 'ß' is U+00DF LATIN SMALL LETTER SHARP S
// 'ẞ' is U+1E9E LATIN CAPITAL LETTER SHARP S

string toUpper = "ß".ToUpperInvariant(); // returns "ß" on all OSes
string toLower = "ẞ".ToLowerInvariant(); // returns "ẞ" on Windows, otherwise "ß"
bool areEqual = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase); // returns "False" on Windows, otherwise "True"

With this proposal, the code above will behave the same across all OSes. They would follow what is today’s non-Windows behavior. They’d be locked to whatever version of the Unicode data we include in the product as part of the CharUnicodeInfo class. This data changes each release to reflect recent modifications to the Unicode Standard. As of this writing, the data contained within the CharUnicodeInfo class follows the Unicode Standard 11.0.0.

Breaking change discussion

Affected APIs:

string / char / Rune equality methods or hash code generation routines which take StringComparison.OrdinalIgnoreCase as a parameter. All other comparisons are unchanged.
string / char / Rune case changing methods when CultureInfo.InvariantCulture is provided. All other cultures are unchanged.
Extension methods on ReadOnlySpan<char> which provide equivalent functionality to the above.
StringComparer.OrdinalIgnoreCase. All other StringComparer instances are unchanged.
Case changing methods on CultureInfo.InvariantCulture.TextInfo.

If GlobalizationMode.Invariant is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.

Applications which depend on OrdinalIgnoreCase equality being stable may be affected by this proposed change. That is, if an application relies on "ß" and "ẞ" being not equal under an OrdinalIgnoreCase comparer, that application is likely to experience odd behavior in the future.

In general, applications cannot rely on such behavior anyway, because as previously mentioned the operating system historically has updated casing tables under the covers without the application getting a say. For example, after installing a new Windows version, a comparison which previously returned false might start returning true:

string a = "ꝍ"; // U+A74D
string b = "Ꝍ"; // U+A74C

// today, may be "True" or "False" depending on which Windows version the app is running on.
// with this proposal, always returns "True"
bool areEqual = string.Equals(a, b, StringComparison.OrdinalIgnoreCase);

Furthermore, the string equality and case mapping information might be different between a web frontend application and the database it’s using for backend storage. So performing such checks at the application level was never 100% reliable to begin with.

There is a potential oddity with this proposal: depending on operating system, two strings which compare as equal using OrdinalIgnoreCase might compare as not equal using InvariantCultureIgnoreCase. For example:

// with this proposal, returns "True" across all OSes
bool equalsOIC = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase);

// with this proposal, returns "False" on Windows, "True" otherwise
bool equalsICIC = "ß".Equals("ẞ", StringComparison.InvariantCultureIgnoreCase);

I don’t expect this to trip up most applications because I don’t believe it to be common for an application to compare a string pair using two different comparers, but it is worth pointing out as a curious edge case.

This may also lead to a discrepancy between managed code which uses StringComparison.OrdinalIgnoreCase and unmanaged code (including within the runtime) which uses CompareStringOrdinal on Windows. I cannot think offhand of any components which do this, but we need to be mindful that such a discrepancy might occur.

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 10
Comments: 47 (41 by maintainers)

Most upvoted comments

What are we going to do about the Windows version that do not include ICU? I doubt that we are going to drop the support for those in near future.

We are talking to Windows team who are going to publish a NuGet packages for ICU. These packages can be used on Windows down level versions. Also, we are going to fallback to NLS in case we couldn’t find any ICU on the system. I am going to write some doc with more details and I’ll share it as soon as I have it ready.

tarekgh on Oct 6, 2019

@StachuDotNet the gist you pointed at was created to mimic the behavior of the casing when running on the .NET Framework and not to be used for .NET 5.0 or 6.0. The result given by the gist code is exactly the result when using ToUpperInvariant on the .NET Framework. So, what you are getting is the expected result.

Also, the file Unicode standard data UnicodeData.txt is not showing these characters have casing form.

FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;;;;N;;;;;
FB02;LATIN SMALL LIGATURE FL;Ll;0;L;<compat> 0066 006C;;;;N;;;;;
0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;;

Please note, casing operation is one-to-one mapping in the .NET.

tarekgh on Mar 22, 2022

I just run string.ToUpperInvariant on all Unicode characters range to collect all cased characters and then from this list I created the 8-4-4 table (for the sake optimizing the size of the table) which can be used for the lookup. if you want to see how to generate such table you may look at https://gist.github.com/tarekgh/55dfaf0f44689738c3a6ca67941ccdc2#file-casefolding-cs-L87.

tarekgh on May 25, 2021

@talktovishal I’ll try to provide you the code in next couple of days.

tarekgh on May 23, 2021

@talktovishal I don’t think there anyway guarantee to get the NETFX casing behavior when running on Linux except if you generate your own casing table from Windows and then manually doing the casing operation using the generated casing table. if you want, I can try to provide some code doing that if this is feasible option to you and how urgent you’ll need that?

tarekgh on May 23, 2021

GetSortKey doesn’t allow passing CompareOptions.OrdinalIgnoreCase. It throws.

tarekgh on Sep 25, 2019