runtime: Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms
Proposal
.NET Core provides APIs to compare strings for ordinal case-insensitive equality (such as via StringComparer.OrdinalIgnoreCase
). The current implementation of this API is to call ToUpperInvariant
on each string, then compare the resulting uppercase strings for bitwise equality.
.NET Core also provides methods to convert char
s, Rune
s, and string
s to uppercase or lowercase using the “invariant” culture (ToUpperInvariant
/ ToLowerInvariant
). The current implementation of this API is to p/invoke NLS on Windows or ICU on non-Windows.
I propose changing the logic so that .NET Core carries its own copy of the ICU “simple” casing tables, and we consult our copies of those tables on all operating systems. This would affect string comparison only when using an OrdinalIgnoreCase
comparison, and it would affect string casing only when using CultureInfo.InvariantCulture
.
Justification
Today, when processing UTF-8 data in a case-insensitive manner (such as via Equals(..., OrdinalIgnoreCase)
), we must first transcode the data to UTF-16 so that it can go through the normal p/invoke routines. This transcoding and p/invoke adds unnecessary overhead. With this proposal, we’d be able to consult our own local copies of the casing table, which eliminates much of this overhead and streamlines the comparison process. This performance boost should also be applicable to existing UTF-16 APIs such as string.ToUpperInvariant
and string.ToLowerInvariant
since we’d be able to optimize those calls.
As mentioned earlier, today’s casing tables involve p/invoking NLS or ICU, depending on platform. This means that comparison / invariant casing APIs could provide different results on different operating systems. Even within the same operating system family, the casing tables can change based on OS version. (Windows 10 1703 has different casing tables than Windows 10 1903, for instance.)
Here are some concrete examples demonstrating the problems:
// 'ß' is U+00DF LATIN SMALL LETTER SHARP S
// 'ẞ' is U+1E9E LATIN CAPITAL LETTER SHARP S
string toUpper = "ß".ToUpperInvariant(); // returns "ß" on all OSes
string toLower = "ẞ".ToLowerInvariant(); // returns "ẞ" on Windows, otherwise "ß"
bool areEqual = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase); // returns "False" on Windows, otherwise "True"
With this proposal, the code above will behave the same across all OSes. They would follow what is today’s non-Windows behavior. They’d be locked to whatever version of the Unicode data we include in the product as part of the CharUnicodeInfo
class. This data changes each release to reflect recent modifications to the Unicode Standard. As of this writing, the data contained within the CharUnicodeInfo
class follows the Unicode Standard 11.0.0.
Breaking change discussion
Affected APIs:
string
/char
/Rune
equality methods or hash code generation routines which takeStringComparison.OrdinalIgnoreCase
as a parameter. All other comparisons are unchanged.string
/char
/Rune
case changing methods whenCultureInfo.InvariantCulture
is provided. All other cultures are unchanged.- Extension methods on
ReadOnlySpan<char>
which provide equivalent functionality to the above. StringComparer.OrdinalIgnoreCase
. All otherStringComparer
instances are unchanged.- Case changing methods on
CultureInfo.InvariantCulture.TextInfo
.
If GlobalizationMode.Invariant
is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.
Applications which depend on OrdinalIgnoreCase
equality being stable may be affected by this proposed change. That is, if an application relies on "ß"
and "ẞ"
being not equal under an OrdinalIgnoreCase
comparer, that application is likely to experience odd behavior in the future.
In general, applications cannot rely on such behavior anyway, because as previously mentioned the operating system historically has updated casing tables under the covers without the application getting a say. For example, after installing a new Windows version, a comparison which previously returned false might start returning true:
string a = "ꝍ"; // U+A74D
string b = "Ꝍ"; // U+A74C
// today, may be "True" or "False" depending on which Windows version the app is running on.
// with this proposal, always returns "True"
bool areEqual = string.Equals(a, b, StringComparison.OrdinalIgnoreCase);
Furthermore, the string equality and case mapping information might be different between a web frontend application and the database it’s using for backend storage. So performing such checks at the application level was never 100% reliable to begin with.
There is a potential oddity with this proposal: depending on operating system, two strings which compare as equal using OrdinalIgnoreCase
might compare as not equal using InvariantCultureIgnoreCase
. For example:
// with this proposal, returns "True" across all OSes
bool equalsOIC = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase);
// with this proposal, returns "False" on Windows, "True" otherwise
bool equalsICIC = "ß".Equals("ẞ", StringComparison.InvariantCultureIgnoreCase);
I don’t expect this to trip up most applications because I don’t believe it to be common for an application to compare a string pair using two different comparers, but it is worth pointing out as a curious edge case.
This may also lead to a discrepancy between managed code which uses StringComparison.OrdinalIgnoreCase
and unmanaged code (including within the runtime) which uses CompareStringOrdinal
on Windows. I cannot think offhand of any components which do this, but we need to be mindful that such a discrepancy might occur.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 10
- Comments: 47 (41 by maintainers)
We are talking to Windows team who are going to publish a NuGet packages for ICU. These packages can be used on Windows down level versions. Also, we are going to fallback to NLS in case we couldn’t find any ICU on the system. I am going to write some doc with more details and I’ll share it as soon as I have it ready.
@StachuDotNet the gist you pointed at was created to mimic the behavior of the casing when running on the .NET Framework and not to be used for .NET 5.0 or 6.0. The result given by the gist code is exactly the result when using ToUpperInvariant on the .NET Framework. So, what you are getting is the expected result.
Also, the file Unicode standard data UnicodeData.txt is not showing these characters have casing form.
Please note, casing operation is one-to-one mapping in the .NET.
I just run
string.ToUpperInvariant
on all Unicode characters range to collect all cased characters and then from this list I created the 8-4-4 table (for the sake optimizing the size of the table) which can be used for the lookup. if you want to see how to generate such table you may look at https://gist.github.com/tarekgh/55dfaf0f44689738c3a6ca67941ccdc2#file-casefolding-cs-L87.@talktovishal I’ll try to provide you the code in next couple of days.
@talktovishal I don’t think there anyway guarantee to get the NETFX casing behavior when running on Linux except if you generate your own casing table from Windows and then manually doing the casing operation using the generated casing table. if you want, I can try to provide some code doing that if this is feasible option to you and how urgent you’ll need that?
GetSortKey doesn’t allow passing CompareOptions.OrdinalIgnoreCase. It throws.