runtime: [ARM64] Performance regression: Utf8Encoding

After running benchmarks for 3.1 vs 5.0 using “Ubuntu arm64 Qualcomm Machines” owned by the JIT Team, I’ve found few regressions related to Utf8Encoding. They are alll reproducible and I’ve verified that it’s not a matter of loop alignment (by running them with --envVars COMPlus_JitAlignLoops:1).

It looks like it’s ARM64 specific regression, I was not able to reproduce it for ARM (the 32 bit variant).

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter Perf_Utf8Encoding

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04 Unknown processor [Host] : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-VTSQOV : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-RAMSQZ : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

Method Runtime Input Mean Ratio Allocated
GetByteCount .NET Core 3.1 EnglishAllAscii 38.00 us 1.00 -
GetByteCount .NET Core 5.0 EnglishAllAscii 40.66 us 1.07 -
GetBytes .NET Core 3.1 EnglishAllAscii 101.09 us 1.00 163840 B
GetBytes .NET Core 5.0 EnglishAllAscii 104.96 us 1.04 163855 B
GetString .NET Core 3.1 EnglishAllAscii 103.47 us 1.00 327648 B
GetString .NET Core 5.0 EnglishAllAscii 95.76 us 0.93 327677 B
GetByteCount .NET Core 3.1 EnglishMostlyAscii 117.50 us 1.00 -
GetByteCount .NET Core 5.0 EnglishMostlyAscii 221.40 us 1.88 -
GetBytes .NET Core 3.1 EnglishMostlyAscii 273.49 us 1.00 169880 B
GetBytes .NET Core 5.0 EnglishMostlyAscii 377.67 us 1.38 169895 B
GetString .NET Core 3.1 EnglishMostlyAscii 262.55 us 1.00 327656 B
GetString .NET Core 5.0 EnglishMostlyAscii 250.18 us 0.95 327685 B
GetByteCount .NET Core 3.1 Chinese 53.34 us 1.00 -
GetByteCount .NET Core 5.0 Chinese 90.21 us 1.69 -
GetBytes .NET Core 3.1 Chinese 245.94 us 1.00 177752 B
GetBytes .NET Core 5.0 Chinese 279.62 us 1.14 177768 B
GetString .NET Core 3.1 Chinese 373.80 us 1.00 150112 B
GetString .NET Core 5.0 Chinese 358.11 us 0.96 150126 B
GetByteCount .NET Core 3.1 Cyrillic 45.35 us 1.00 -
GetByteCount .NET Core 5.0 Cyrillic 76.01 us 1.68 -
GetBytes .NET Core 3.1 Cyrillic 193.34 us 1.00 100880 B
GetBytes .NET Core 5.0 Cyrillic 222.10 us 1.15 100889 B
GetString .NET Core 3.1 Cyrillic 262.69 us 1.00 130856 B
GetString .NET Core 5.0 Cyrillic 259.83 us 0.99 130868 B
GetByteCount .NET Core 3.1 Greek 58.36 us 1.00 -
GetByteCount .NET Core 5.0 Greek 97.41 us 1.67 -
GetBytes .NET Core 3.1 Greek 275.88 us 1.00 129248 B
GetBytes .NET Core 5.0 Greek 314.00 us 1.14 129260 B
GetString .NET Core 3.1 Greek 394.55 us 1.00 164264 B
GetString .NET Core 5.0 Greek 394.35 us 1.00 164278 B

Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

cc @kunalspathak @carlossanlop @pgovind @tannergooding

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 37 (37 by maintainers)

Most upvoted comments

If we really need to do something in 5.0 and we’re running out of runway then the absolute safest thing to do would be to change the one line:

https://github.com/dotnet/runtime/blob/7d0d37001c5a0ac1f343cd35b27fea0ebf7e8101/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs#L82

To:

if (Sse2.IsSupported)

This will cause the UTF8Encoding.GetByteCount(string) method to fall back to the existing Vector<T>-based code paths as they existed in 3.1 instead of using the new intrinsics that were introduced in 5.0.

I assume that if we want to do this then we’d schedule a “proper” fix to come in during 5.0.1.

@echesakovMSFT I’ll need to take a closer look, but as an initial assessment the temp += (temp >> 32); should be slightly better if you are generating an ADD with a shifted register. (as in, a single instruction rather than a separate add + shift).

That said looking at the algorithm, do you really need the need the reduction inside the loop? The value seems to really only be a counter. So instead can’t you keep the value as a vector128<uint> during the loop and perform the final addp and move to genreg side after the loop and add it to tempUtf8CodeUnitCountAdjustment ?

I think we should look at the function as a whole instead of piece wise.

For instance since the only things done on popcnt are add and sub there’s no need to transfer between register files in the loop.

+        private static Vector64<uint> CountNonAsciiBytes(Vector128<byte> vec)

and using AddScalar instead during the loop avoids the transfer as we can do scalar arithmetic on the SIMD side.

@jeffhandley I am assigning this to you now.

Below measurement are done on

processor       : 0
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 1
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 2
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 3
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

.NET Core 3.1.6


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT
  Job-VJGWPE : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 3.1  Arguments=/p:DebugType=portable
Toolchain=netcoreapp3.1  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 118.9 μs 0.24 μs 0.21 μs 118.9 μs 118.6 μs 119.4 μs - - - -
GetBytes EnglishAllAscii 268.4 μs 2.14 μs 1.78 μs 267.6 μs 266.4 μs 272.4 μs 49.7881 49.7881 49.7881 163840 B
GetString EnglishAllAscii 322.4 μs 1.78 μs 1.49 μs 321.7 μs 321.5 μs 326.5 μs 99.4898 99.4898 99.4898 327648 B
GetByteCount EnglishMostlyAscii 328.3 μs 0.43 μs 0.39 μs 328.1 μs 328.0 μs 329.3 μs - - - -
GetBytes EnglishMostlyAscii 680.3 μs 0.80 μs 0.62 μs 680.3 μs 679.3 μs 681.5 μs 51.6304 51.6304 51.6304 169880 B
GetString EnglishMostlyAscii 591.1 μs 2.17 μs 1.92 μs 590.2 μs 588.7 μs 594.6 μs 99.5370 99.5370 99.5370 327656 B
GetByteCount Chinese 149.9 μs 0.29 μs 0.26 μs 149.9 μs 149.7 μs 150.5 μs - - - -
GetBytes Chinese 646.8 μs 1.21 μs 0.95 μs 647.0 μs 645.1 μs 647.9 μs 55.0000 55.0000 55.0000 177752 B
GetString Chinese 943.7 μs 2.97 μs 2.63 μs 943.3 μs 940.9 μs 950.1 μs 44.1176 44.1176 44.1176 150112 B
GetByteCount Cyrillic 130.3 μs 0.21 μs 0.19 μs 130.4 μs 130.0 μs 130.7 μs - - - -
GetBytes Cyrillic 487.0 μs 1.30 μs 1.08 μs 486.8 μs 485.2 μs 489.2 μs 29.2969 29.2969 29.2969 100880 B
GetString Cyrillic 648.8 μs 1.74 μs 1.45 μs 649.5 μs 646.6 μs 650.9 μs 39.0625 39.0625 39.0625 130856 B
GetByteCount Greek 163.7 μs 0.08 μs 0.07 μs 163.7 μs 163.6 μs 163.8 μs - - - -
GetBytes Greek 723.1 μs 3.83 μs 3.58 μs 721.8 μs 718.7 μs 728.8 μs 39.7727 39.7727 39.7727 129248 B
GetString Greek 968.3 μs 8.42 μs 7.47 μs 965.3 μs 960.8 μs 980.7 μs 47.7941 47.7941 47.7941 164264 B

NET Core 5.0.0


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-OUMKUS : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.6 μs 0.51 μs 0.47 μs 109.3 μs 109.3 μs 110.6 μs - - - -
GetBytes EnglishAllAscii 261.1 μs 2.71 μs 2.53 μs 259.9 μs 258.7 μs 266.3 μs 48.9583 48.9583 48.9583 163854 B
GetString EnglishAllAscii 205.6 μs 0.58 μs 0.46 μs 205.6 μs 204.8 μs 206.3 μs 99.5066 99.5066 99.5066 327677 B
GetByteCount EnglishMostlyAscii 565.9 μs 0.93 μs 0.78 μs 565.9 μs 564.2 μs 567.2 μs - - - 1 B
GetBytes EnglishMostlyAscii 912.2 μs 1.65 μs 1.29 μs 912.0 μs 910.3 μs 914.8 μs 52.0833 52.0833 52.0833 169896 B
GetString EnglishMostlyAscii 574.2 μs 1.69 μs 1.32 μs 573.6 μs 572.5 μs 576.9 μs 99.5370 99.5370 99.5370 327685 B
GetByteCount Chinese 258.3 μs 0.83 μs 0.77 μs 257.9 μs 257.5 μs 259.8 μs - - - -
GetBytes Chinese 749.3 μs 3.55 μs 3.14 μs 747.9 μs 746.7 μs 756.1 μs 53.5714 53.5714 53.5714 177768 B
GetString Chinese 896.2 μs 9.65 μs 9.03 μs 891.6 μs 889.5 μs 914.0 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 223.8 μs 0.46 μs 0.39 μs 223.6 μs 223.4 μs 224.8 μs - - - -
GetBytes Cyrillic 592.1 μs 4.66 μs 4.13 μs 590.1 μs 588.8 μs 602.4 μs 30.0926 30.0926 30.0926 100889 B
GetString Cyrillic 630.9 μs 1.32 μs 1.17 μs 630.6 μs 629.6 μs 633.7 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 281.7 μs 0.35 μs 0.31 μs 281.8 μs 281.2 μs 282.3 μs - - - -
GetBytes Greek 844.0 μs 2.50 μs 2.09 μs 843.5 μs 841.8 μs 848.9 μs 39.4737 39.4737 39.4737 129260 B
GetString Greek 951.6 μs 10.59 μs 9.91 μs 949.3 μs 941.3 μs 971.0 μs 47.7941 47.7941 47.7941 164279 B

NET Core 5.0.0 (with the suggested change to Utf16Utility.GetPointerToFirstInvalidChar)


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-XEOYJB : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.2 μs 0.46 μs 0.41 μs 109.0 μs 108.8 μs 110.0 μs - - - -
GetBytes EnglishAllAscii 260.3 μs 2.08 μs 1.74 μs 259.7 μs 259.0 μs 265.2 μs 48.9583 48.9583 48.9583 163854 B
GetString EnglishAllAscii 207.3 μs 1.23 μs 1.03 μs 206.9 μs 206.3 μs 209.7 μs 99.5066 99.5066 99.5066 327677 B
GetByteCount EnglishMostlyAscii 414.5 μs 0.49 μs 0.41 μs 414.3 μs 414.3 μs 415.6 μs - - - -
GetBytes EnglishMostlyAscii 753.2 μs 3.51 μs 2.93 μs 752.4 μs 750.2 μs 759.9 μs 50.5952 50.5952 50.5952 169895 B
GetString EnglishMostlyAscii 574.0 μs 8.49 μs 7.94 μs 570.8 μs 566.9 μs 590.0 μs 98.2143 98.2143 98.2143 327685 B
GetByteCount Chinese 189.1 μs 0.16 μs 0.14 μs 189.1 μs 189.0 μs 189.5 μs - - - -
GetBytes Chinese 675.3 μs 1.07 μs 0.89 μs 675.1 μs 673.9 μs 677.0 μs 54.3478 54.3478 54.3478 177768 B
GetString Chinese 895.7 μs 6.66 μs 6.23 μs 892.0 μs 889.4 μs 904.7 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 164.1 μs 0.11 μs 0.09 μs 164.1 μs 164.0 μs 164.3 μs - - - -
GetBytes Cyrillic 527.7 μs 2.71 μs 2.40 μs 527.0 μs 525.4 μs 533.3 μs 29.1667 29.1667 29.1667 100889 B
GetString Cyrillic 624.7 μs 2.56 μs 2.00 μs 624.6 μs 622.2 μs 630.4 μs 38.4615 38.4615 38.4615 130868 B
GetByteCount Greek 206.9 μs 0.44 μs 0.39 μs 206.7 μs 206.6 μs 208.0 μs - - - -
GetBytes Greek 764.2 μs 3.49 μs 3.27 μs 762.9 μs 760.4 μs 769.6 μs 38.6905 38.6905 38.6905 129260 B
GetString Greek 962.2 μs 4.29 μs 3.59 μs 961.0 μs 958.4 μs 970.9 μs 46.8750 46.8750 46.8750 164279 B

It’s clear from the data for GetByteCount benchmark the issue with stalled cycles due to PopCount is one of potentially many causes of the performance regression here. We need to do thorough analysis to discover them all.

I am moving this to .NET 6.0. I don’t believe this is a JIT issue, so I am relabeling this back to area-System.Text.Encoding.

cc @JulieLeeMSFT @jeffhandley