runtime: [ARM64] Performance regression: Sorting arrays of primitive types

After running benchmarks for 3.1 vs 5.0 using “Ubuntu arm64 Qualcomm Machines” owned by the JIT Team, I’ve found few regressions related to sorting.

It can be related to the fact that we have moved the sorting implementation from native to managed code (cc @stephentoub).

@DrewScoggins is there any way to see the full historical data for ARM64?

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'System.Collections.Sort<Int32>.Array' 'System.Collections.Sort<Int32>.List'

System.Collections.Sort<Int32>.Array(Size: 512)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 4500.17 4250.27 1.06 +0 Windows 10.0.19041.388 X64 AMD Ryzen 9 3900X 3.1.6 5.0.20.41714
Faster 4129.79 3573.10 1.16 +0 several? Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 4197.36 3672.08 1.14 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40416
Faster 5247.31 4572.02 1.15 +0 multimodal Windows 10.0.19041.450 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 6560.47 6483.01 1.01 +0 bimodal Windows 10.0.19041.450 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 3.1.6 5.0.20.40416
Same 3315.74 3404.51 0.97 +0 Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 4821.34 8602.70 0.56 +0 several? Windows 10.0.19041.450 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 3.1.6 5.0.20.41714
Faster 6214.65 3992.76 1.56 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 6441.58 4187.34 1.54 +0 manjaro X64 Intel Core i7-4771 CPU 3.50GHz (Haswell) 3.1.6 5.0.20.41714
Same 5831.01 5518.00 1.06 +0 bimodal pop 20.04 X64 Intel Core i7-6600U CPU 2.60GHz (Skylake) 3.1.6 5.0.20.41714
Same 5602.64 5486.66 1.02 +0 bimodal alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.41714
Same 6178.62 5918.08 1.04 +0 bimodal ubuntu 18.04 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 16288.81 27067.03 0.60 +0 bimodal ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 16150.43 27030.75 0.60 +0 bimodal ubuntu 16.04 Arm64 Unknown processor 3.1.7 5.0.20.41714
Slower 16139.02 26145.59 0.62 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 14011.07 17551.48 0.80 +0 ubuntu 18.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Faster 7375.30 4817.14 1.53 +0 several? Windows 10.0.19041.450 X86 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 7244.05 7548.09 0.96 +0 several? Windows 10.0.18363.1016 Arm Microsoft SQ1 3.0 GHz 3.1.6 5.0.20.40416
Faster 7932.56 5120.47 1.55 +0 macOS Catalina 10.15.6 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7060.74 4554.61 1.55 +0 macOS Catalina 10.15.6 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7145.76 4761.73 1.50 +0 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40203

System.Collections.Sort<Int32>.List(Size: 512)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Faster 4520.64 4071.51 1.11 +0 Windows 10.0.19041.388 X64 AMD Ryzen 9 3900X 3.1.6 5.0.20.41714
Faster 4332.89 3567.66 1.21 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 6967.61 3518.82 1.98 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40416
Faster 5015.13 4448.40 1.13 +0 bimodal Windows 10.0.19041.450 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 7293.88 7863.56 0.93 +0 several? Windows 10.0.19041.450 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 3.1.6 5.0.20.40416
Same 3339.08 3267.09 1.02 +0 Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 4806.28 8450.61 0.57 +0 bimodal Windows 10.0.19041.450 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 3.1.6 5.0.20.41714
Faster 6222.44 3938.96 1.58 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 6146.14 4254.57 1.44 +0 manjaro X64 Intel Core i7-4771 CPU 3.50GHz (Haswell) 3.1.6 5.0.20.41714
Same 5834.00 5493.04 1.06 +0 pop 20.04 X64 Intel Core i7-6600U CPU 2.60GHz (Skylake) 3.1.6 5.0.20.41714
Same 5571.07 6275.42 0.89 +0 several? alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.41714
Same 6217.53 6206.80 1.00 +0 multimodal ubuntu 18.04 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 16281.08 27217.40 0.60 +0 bimodal ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 16329.85 27111.99 0.60 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.7 5.0.20.41714
Slower 16395.71 26771.27 0.61 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 7791.41 10397.92 0.75 +0 several? ubuntu 18.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Faster 7371.80 4824.34 1.53 +0 Windows 10.0.19041.450 X86 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 7002.96 7065.54 0.99 +0 several? Windows 10.0.18363.1016 Arm Microsoft SQ1 3.0 GHz 3.1.6 5.0.20.40416
Faster 7976.12 5098.72 1.56 +0 macOS Catalina 10.15.6 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7114.58 4576.65 1.55 +0 macOS Catalina 10.15.6 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7157.62 4735.35 1.51 +0 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40203

/cc @JulieLeeMSFT

category:cq theme:needs-triage skill-level:expert cost:large

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 28 (28 by maintainers)

Most upvoted comments

It looks like we regained the missing perf with #43870.

6.0 data now shows about 16 us newplot (16)

which is roughly where we were in the 3.1 timeframe.

The List case shows similar improvements

newplot (17)

Going to close this one as fixed.

Linux arm64. Probably Windows arm64 too, but we don’t track that.

I can repro this on an Rpi4:

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
Unknown processor
.NET Core SDK=5.0.100-rc.1.20422.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.42118, CoreFX 5.0.20.42118), Arm64 RyuJIT
  Job-HPEYTJ : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), Arm64 RyuJIT
  Job-LNVBZK : .NET Core 5.0.0 (CoreCLR 5.0.20.42118, CoreFX 5.0.20.42118), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  InvocationCount=5000  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
UnrollFactor=1  WarmupCount=1  

| Method |        Job |       Runtime |     Toolchain | Size |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |----------- |-------------- |-------------- |----- |---------:|---------:|---------:|---------:|---------:|---------:|------:|--------:|------:|------:|------:|----------:|
|  Array | Job-HPEYTJ | .NET Core 3.1 | netcoreapp3.1 |  512 | 26.97 us | 0.352 us | 0.329 us | 26.86 us | 26.43 us | 27.56 us |  1.00 |    0.00 |     - |     - |     - |         - |
|  Array | Job-LNVBZK | .NET Core 5.0 | netcoreapp5.0 |  512 | 35.61 us | 0.196 us | 0.184 us | 35.62 us | 35.29 us | 35.90 us |  1.32 |    0.02 |     - |     - |     - |         - |
|        |            |               |               |      |          |          |          |          |          |          |       |         |       |       |       |           |
|   List | Job-HPEYTJ | .NET Core 3.1 | netcoreapp3.1 |  512 | 28.54 us | 1.530 us | 1.762 us | 27.54 us | 27.12 us | 31.89 us |  1.00 |    0.00 |     - |     - |     - |         - |
|   List | Job-LNVBZK | .NET Core 5.0 | netcoreapp5.0 |  512 | 36.15 us | 0.417 us | 0.390 us | 36.20 us | 34.91 us | 36.51 us |  1.25 |    0.08 |     - |     - |     - |       1 B |

What’s the difference between the two tables? Is it just two different sets of runs?

Never mind – List vs Array.