runtime: Perf: InvariantCultureIgnoreCase 2x to 6x times slower on Linux

While investigating why the Data Updates TechEmpower benchmark was much slower on Linux than on Windows, we identified the usage of InvariantCultureIgnoreCase in the Npgsql driver was responsible.

Seems worst on ASP.NET (6 times slower) than benchmarkdotnet (2 times slower), where there are no concurrent calls.

Here is the code that was used in BDN and ASP.NET to repro the numbers:

[Benchmark]
public object TryAddInvariantCultureIgnoreCase()
{
    var data = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
    for (var i = 0; i < N; i++)
    {
        data.TryAdd("Id_", i);
    }

    return data;
}

On ASP.NET, with 100 inserts, we have 80K RPS on Windows and 15K RPS on Linux.

The trace doesn’t go further than this

image

Linux


BenchmarkDotNet=v0.11.5, OS=debian 9
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host] : .NET Core ? (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJIT

Toolchain=InProcessEmitToolchain  IterationCount=3  LaunchCount=1
WarmupCount=3

Method N Mean Error StdDev
TryAddOrdinalIgnoreCase 10 275.4 ns 440.5 ns 24.15 ns
TryAddInvariantCultureIgnoreCase 10 3,442.1 ns 484.8 ns 26.57 ns
TryAddOrdinalIgnoreCase 50 1,238.5 ns 301.0 ns 16.50 ns
TryAddInvariantCultureIgnoreCase 50 16,930.4 ns 2,142.9 ns 117.46 ns
TryAddOrdinalIgnoreCase 100 2,401.3 ns 426.6 ns 23.38 ns
TryAddInvariantCultureIgnoreCase 100 33,979.7 ns 6,911.2 ns 378.83 ns

Windows


BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.107 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host] : .NET Core 3.0.0-preview6-27713-13 (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJI

Toolchain=InProcessEmitToolchain  IterationCount=3  LaunchCount=1
WarmupCount=3

Method N Mean Error StdDev
TryAddOrdinalIgnoreCase 10 239.3 ns 2.354 ns 0.1290 ns
TryAddInvariantCultureIgnoreCase 10 1,694.5 ns 23.130 ns 1.2679 ns
TryAddOrdinalIgnoreCase 50 1,145.2 ns 7.520 ns 0.4122 ns
TryAddInvariantCultureIgnoreCase 50 8,332.2 ns 166.891 ns 9.1479 ns
TryAddOrdinalIgnoreCase 100 2,241.1 ns 379.392 ns 20.7958 ns
TryAddInvariantCultureIgnoreCase 100 16,191.5 ns 12.769 ns 0.6999 ns

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 49 (48 by maintainers)

Most upvoted comments

Ok, I was able to use the same machine as @sebastienros and confirm that the problem is gone.

Changes

https://github.com/dotnet/coreclr/pull/24889 - I have reduced the number of calls to an expensive native API by half, which made StringComparison.InvariantCulture.GetHashCode() two times faster for strings shorter than 262 144 characters (the biggest buffer rentable from ArrayPool.Shared without allocation / 4)

https://github.com/dotnet/coreclr/pull/24973 - I have removed the lock that was part of native implementation of CompareString, IndexOf, LastIndexOf, StartsWith, EndsWith, GetHashCode

https://github.com/dotnet/coreclr/pull/25117 - I have removed the expensive ref counting of SafeSortHandle usages and replaced SafeSortHandle with a cached IntPtr (the number of cultures and sort options is always low so it’s ok), the affected methods are CompareString, IndexOf, LastIndexOf, StartsWith, EndsWith, GetHashCode

25117

The most recent change had the biggest impact on the beefy Citrine Physical Machines (14 Core(s), 28 Logical Processor(s))

before

RequestsPerSecond:           40 811
Max CPU (%):                 100
WorkingSet (MB):             496
Avg. Latency (ms):           6,63
Startup (ms):                298
First Request (ms):          102,54
Latency (ms):                0,86
Total Requests:              208 136
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012412
Runtime:                     3.0.0-preview7-27815-04
ASP.NET Core:                3.0.0-preview7.19315.2

after

RequestsPerSecond:           240 711
Max CPU (%):                 94
WorkingSet (MB):             523
Avg. Latency (ms):           1,77
Startup (ms):                338
First Request (ms):          121,31
Latency (ms):                1,04
Total Requests:              1 227 445
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0

Linux vs Windows after all changes

Linux vs Windows after all fixes for the ASP.NET perf lab machines

Local Physical servers (E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM.)

110k RPS for Linux and 90k for Windows

Citrine Physical Machines (Intel® Xeon® Gold 5120 CPU @ 2.20GHz, 2195 Mhz, 14 Core(s), 28 Logical Processor(s) 32GB of RAM.)

240k RPS for Linux and 265k RPS for Windows

Azure VMs (D3_v2 4 virtual cores. 4GB of RAM)

28k RPS for both Linux and Windows

Summary

For the Azure VMs scenarios both OSes are even, for Local Physical servers Linux is now 20% faster, for Citrine Physical Machines Windows is still 10% faster.

I currently run out of ideas for how to close the 10% gap for the Citrine Physical Machines machines, the trace does not leave much space for optimizations.

obraz

@sebastienros please close the issue after you verify that the problem is gone

big thanks to @jkotas and @tarekgh for all the help!

FYI for everyone, I added the support for Event Counters on the benchmarking service (–collect-counters) and it’s tracked with every ASP.NET scenario now. We’ll get continuous charts for these numbers too now.

Sneak peek:

CPU Usage (%):               99
Working Set (MB):            526
GC Heap Size (MB):           281
Gen 1 GC (#/s):              1
Gen 2 GC (#/s):              0
Time in GC (%):              0
Gen 0 Size (B):              135,464
Gen 1 Size (B):              3,033,544
Gen 2 Size (B):              2,872,120
LOH Size (B):                1,278,288
Allocation Rate (B/sec):     227,329,432
# of Assemblies Loaded:      110
Exceptions (#/s):            985
ThreadPool Threads Count:    48
Lock Contention (#/s):       77
ThreadPool Queue Length:     222
ThreadPool Items (#/s):      89,701

I have used the recommendations provided in the ICU User Guide and made it two times faster https://github.com/dotnet/coreclr/pull/24889

Before and after the changes from @adamsitnik

image

Yes, we can.

@sebastienros I verified that the fix works with the latest coreclr packages that were pushed to core-setup in the last 24h.

@sebastienros could you please take a look at the commands I run? The results we get for Windows are very different: 271k vs 90k RPS.

In the meantime, I am going to take a look at the two hottest method from the stack trace:

System.Private.CoreLib!System.Runtime.InteropServices.SafeHandle::InternalRelease(bool)[OptimizedTier1] 
System.Private.CoreLib!System.Runtime.InteropServices.SafeHandle::DangerousAddRef(bool&)[OptimizedTier1]

I would suggest to measure this scenario in isolation first before jumping in trying to optimize it.

When I wrote “From a quick look at the profiles” I meant that I have already profiled that.

Anyway, thanks for the hint. The more people repeat “measure, measure, measure” the better!

Tagging @adamsitnik as Dan assigned him to take a look