runtime: Perf: InvariantCultureIgnoreCase 2x to 6x times slower on Linux
While investigating why the Data Updates TechEmpower benchmark was much slower on Linux than on Windows, we identified the usage of InvariantCultureIgnoreCase
in the Npgsql driver was responsible.
Seems worst on ASP.NET (6 times slower) than benchmarkdotnet (2 times slower), where there are no concurrent calls.
Here is the code that was used in BDN and ASP.NET to repro the numbers:
[Benchmark]
public object TryAddInvariantCultureIgnoreCase()
{
var data = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
for (var i = 0; i < N; i++)
{
data.TryAdd("Id_", i);
}
return data;
}
On ASP.NET, with 100 inserts, we have 80K RPS on Windows and 15K RPS on Linux.
The trace doesn’t go further than this
Linux
BenchmarkDotNet=v0.11.5, OS=debian 9
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
[Host] : .NET Core ? (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJIT
Toolchain=InProcessEmitToolchain IterationCount=3 LaunchCount=1
WarmupCount=3
Method | N | Mean | Error | StdDev |
---|---|---|---|---|
TryAddOrdinalIgnoreCase | 10 | 275.4 ns | 440.5 ns | 24.15 ns |
TryAddInvariantCultureIgnoreCase | 10 | 3,442.1 ns | 484.8 ns | 26.57 ns |
TryAddOrdinalIgnoreCase | 50 | 1,238.5 ns | 301.0 ns | 16.50 ns |
TryAddInvariantCultureIgnoreCase | 50 | 16,930.4 ns | 2,142.9 ns | 117.46 ns |
TryAddOrdinalIgnoreCase | 100 | 2,401.3 ns | 426.6 ns | 23.38 ns |
TryAddInvariantCultureIgnoreCase | 100 | 33,979.7 ns | 6,911.2 ns | 378.83 ns |
Windows
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.107 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
[Host] : .NET Core 3.0.0-preview6-27713-13 (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJI
Toolchain=InProcessEmitToolchain IterationCount=3 LaunchCount=1
WarmupCount=3
Method | N | Mean | Error | StdDev |
---|---|---|---|---|
TryAddOrdinalIgnoreCase | 10 | 239.3 ns | 2.354 ns | 0.1290 ns |
TryAddInvariantCultureIgnoreCase | 10 | 1,694.5 ns | 23.130 ns | 1.2679 ns |
TryAddOrdinalIgnoreCase | 50 | 1,145.2 ns | 7.520 ns | 0.4122 ns |
TryAddInvariantCultureIgnoreCase | 50 | 8,332.2 ns | 166.891 ns | 9.1479 ns |
TryAddOrdinalIgnoreCase | 100 | 2,241.1 ns | 379.392 ns | 20.7958 ns |
TryAddInvariantCultureIgnoreCase | 100 | 16,191.5 ns | 12.769 ns | 0.6999 ns |
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 49 (48 by maintainers)
Ok, I was able to use the same machine as @sebastienros and confirm that the problem is gone.
Changes
https://github.com/dotnet/coreclr/pull/24889 - I have reduced the number of calls to an expensive native API by half, which made
StringComparison.InvariantCulture.GetHashCode()
two times faster for strings shorter than 262 144 characters (the biggest buffer rentable from ArrayPool.Shared without allocation / 4)https://github.com/dotnet/coreclr/pull/24973 - I have removed the lock that was part of native implementation of
CompareString
,IndexOf
,LastIndexOf
,StartsWith
,EndsWith
,GetHashCode
https://github.com/dotnet/coreclr/pull/25117 - I have removed the expensive ref counting of
SafeSortHandle
usages and replacedSafeSortHandle
with a cachedIntPtr
(the number of cultures and sort options is always low so it’s ok), the affected methods areCompareString
,IndexOf
,LastIndexOf
,StartsWith
,EndsWith
,GetHashCode
25117
The most recent change had the biggest impact on the beefy Citrine Physical Machines (14 Core(s), 28 Logical Processor(s))
before
after
Linux vs Windows after all changes
Linux vs Windows after all fixes for the ASP.NET perf lab machines
Local Physical servers (E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM.)
110k RPS for Linux and 90k for Windows
Citrine Physical Machines (Intel® Xeon® Gold 5120 CPU @ 2.20GHz, 2195 Mhz, 14 Core(s), 28 Logical Processor(s) 32GB of RAM.)
240k RPS for Linux and 265k RPS for Windows
Azure VMs (D3_v2 4 virtual cores. 4GB of RAM)
28k RPS for both Linux and Windows
Summary
For the Azure VMs scenarios both OSes are even, for Local Physical servers Linux is now 20% faster, for Citrine Physical Machines Windows is still 10% faster.
I currently run out of ideas for how to close the 10% gap for the Citrine Physical Machines machines, the trace does not leave much space for optimizations.
@sebastienros please close the issue after you verify that the problem is gone
big thanks to @jkotas and @tarekgh for all the help!
FYI for everyone, I added the support for Event Counters on the benchmarking service (–collect-counters) and it’s tracked with every ASP.NET scenario now. We’ll get continuous charts for these numbers too now.
Sneak peek:
I have used the recommendations provided in the ICU User Guide and made it two times faster https://github.com/dotnet/coreclr/pull/24889
Before and after the changes from @adamsitnik
Yes, we can.
@sebastienros I verified that the fix works with the latest coreclr packages that were pushed to core-setup in the last 24h.
@sebastienros could you please take a look at the commands I run? The results we get for Windows are very different: 271k vs 90k RPS.
In the meantime, I am going to take a look at the two hottest method from the stack trace:
When I wrote “From a quick look at the profiles” I meant that I have already profiled that.
Anyway, thanks for the hint. The more people repeat “measure, measure, measure” the better!
Tagging @adamsitnik as Dan assigned him to take a look