runtime: List of performance regressions caused by switching to ICU
Before 5.0, we were using ICU only on Unix systems. In 5.0 we have decided to use it on Windows by default as well.
This is something that we have done in order to have the same behavior of string-related globalization APIs on every OS.
However, this particular change has affected the performance characteristics of many frequently used methods. Some of them have regressed, some have improved.
Recently we have reported a lot of 5.0 regressions related to that. Since we have done this on purpose and we are most probably not going to revert the switch, I am opening this issue to track the list of all known regressions. When the list becomes complete, we are most probably going to update the 5.0 release docs and close the issue and label it as wont fix.
Please feel free to edit the list.
Known changes:
- System.Memory.ReadOnlySpan.IndexOfString 5 regressions, 5 improvements
- System.Globalization.Tests.StringHash 2 regressions, 4 improvements
- System.Globalization.Tests.StringEquality: 8 regressions, 14 improvements
- System.Globalization.Tests.StringSearch: 30 regressions, 32 improvements
- System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse: 1 regression, 5 improvements
cc @danmosemsft @tarekgh @billwert @DrewScoggins @GrabYourPitchforks @jkotas @safern
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 8
- Comments: 69 (65 by maintainers)
After the most recent check in (https://github.com/dotnet/runtime/pull/43065) that Tarek made, we are seeing some good improvements in the lab.
Run Information
Regressions in System.Globalization.Tests.StringSearch
Repro
Histogram
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, None, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (, None, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (pl-PL, None, False))
System.Globalization.Tests.StringSearch.IsSuffix_SecondHalf(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IsPrefix_FirstHalf(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IsPrefix_DifferentFirstChar(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (pl-PL, None, False))
System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, IgnoreCase, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, IgnoreCase, True))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, None, True))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, IgnoreCase, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, None, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (, IgnoreCase, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, IgnoreSymbols, False))
Docs
Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository
To tell, @jefgen thankfully is in doing some optimization work from the ICU side
https://github.com/unicode-org/icu/pull/1471 https://github.com/unicode-org/icu/pull/1473
You can see the perf numbers in the ticket https://unicode-org.atlassian.net/browse/ICU-21388
Thanks @jefgen for the details.
@L2 you still have the option to use the ICU app-local feature to use the latest ICU version if you want.
@tarekgh I’ve run the benchmark that you have provided and got similar results that confirm that 5.0 using ICU is faster in this particular case.
To compare apples to apples I’ve set the invocation count (the number of benchmark invocations per iteration) to the same number and filtered the ETW trace file to the last benchmark iteration (description of the mentioned filtering):
For 3.1 a single iteration (2097152 invocations) takes 791ms:
For 5.0 a single iteration (2097152 invocations) takes 147ms:
I see. Many thanks!
I think it makes no sense because because more and more users work in a heterogeneous environment and with ICU there is less chance of getting different results.
Questions like why the performance/behavior is different on Windows and on Unix is very inconvenient. Questions like this sometimes appear in PowerShell repo (not related to the topic). With moving to ICU the likelihood of such questions decreases and this is great. PowerShell is highly dependent on OrdinalIgnoreCase and any improvements here have a positive impact on it. This is my only concern.
Thanks again for your great work!
Also, I want to be clear about the expectation here. I don’t think we can fix all perf here as we are limited by calling ICU. we can look at how we can improve it but I am not expecting to get the perf to the point where we used to call NLS. So, it will be good to decide which items in this list is a blocker. The only one was the Ordinal cases which I am addressing in the attached PR. I am not aware of any other blocking scenario. We’ll look more of course on other scenarios anyway but I am not sure how much we can do before 5.0 release.
Just wanted to update this thread with the auto-filed results showing big wins (70%) in the System.Memory.ReadOnlySpan benchmarks that regressed.
https://github.com/DrewScoggins/performance-2/issues/1392
@GrabYourPitchforks @GSPP just to let you know, I am looking at the linguistic IndexOf scenario and experimenting some changes that may help in the perf (around internally caching some ICU objects too). no promise yet as I am still in the middle of looking at that.
Also, we are following up with Windows team as they are trying to do some perf enhancement on ICU too. It is another win situation.
Last, as ICU is open source project, we have contributed some changes before which means it is possible we’ll contribute more if it is really required to enhance .NET scenarios. but in general this something we may look at for 6.0 version and beyond.
As a side note, ICU is open source and we can contribute to make hot paths faster
The story is unfortunately a bit complicated… The version of ICU that is part of the Windows OneCore base was updated to a newer version, version 68.2. However, Windows 10 is still based on the older OneCore release, so this means that Windows 10 didn’t get the updated version of ICU. The Windows 11 release was built on the newer OneCore bits, so it got the updated version of ICU. (You can see that the Windows 10 build number is still ~1904x, so the last few releases are all based on the same underlying OneCore bits).
What this means is that Windows 10 is still on ICU version 64.2, while Windows 11 is on ICU version 68.2.
I don’t think there are currently any plans to push an updated version of ICU via Windows update at this time.
AFAIK, the OS binaries aren’t built using the public compiler. However, I’m not sure off-hand what exact version of the
cl.execompiler is used though.@jefgen very nice improvements! did Windows Team consider using PGO (Profile Guided Optimization) to improve ICU perf?
For who don’t know what is PGO, it is Profile-guided optimization.
@DrewScoggins I have merged today the other optimization work which targeting string search operations (IndexOf/LastIndexOf/IsPrefix/IsSuffix/StartsWith/EndsWith). could you please watch the perf results after running my changes and update this issue? Thanks a lot.
@iSazonov I tried your scenario (using .NET without PS) and I am seeing the similar results as you have reported it. This is very interesting. I am going to look more on the details to understand what is going on.
@GSPP I don’t think we have extensively studied what contributions we can make here. Regardless, ICU and NLS have different operational philosophies when it comes to this, and ICU’s consumers are absolutely used to caching the search object. The canonical scenario for linguistic searching in ICU is for a UI-based application. You open a browser window or word document, enter your search term, then see all matches highlighted in the document and use Next / Previous to iterate through them.
In our case, we case about non-linguistic (ordinal) searching. If this were to be contributed back to ICU it would be in the form of a brand new API. Furthermore, the type of ordinal comparison we’re using here (conversion to uppercase) is different than Unicode’s own recommendations (conversion to case-fold). All of this is to say that I’m not hopeful of a specialty API like the one .NET is using making its way through.
Would it be in the cards to contribute to ICU so that this operation is not by-design slow? I imagine that many projects relying on ICU would like a fast
IndexOfthat does require caching a searcher object.@tarekgh your change is good. I was just curious why Linux and Windows perf were still significantly different but I realized I compared the wrong rows.
A day before PowerShell MSFT team tried to move to .Net 5.0 Preview8 but without success. I guess I can do some measurements only after we move to RC1.
/cc @SteveL-MSFT @daxian-dbw for information.
Also note that this is already what is used on Linux and Mac, and increasingly used within Windows OS itself. So we are aligning with the industry here. If and where it is slow - we all benefit from making it faster. The .NET team have contributed bug reports and performance improvements to libicu in the past, and I expect we will do so again.
@tarekgh I’ve gone through all
System.MemoryandSystem.Globalizationissues withperformancetag and updated the list. It should be complete now.