runtime: [ARM64] Performance regression: Utf8Encoding

After running benchmarks for 3.1 vs 5.0 using “Ubuntu arm64 Qualcomm Machines” owned by the JIT Team, I’ve found few regressions related to Utf8Encoding. They are alll reproducible and I’ve verified that it’s not a matter of loop alignment (by running them with --envVars COMPlus_JitAlignLoops:1).

It looks like it’s ARM64 specific regression, I was not able to reproduce it for ARM (the 32 bit variant).

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter Perf_Utf8Encoding

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04 Unknown processor [Host] : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-VTSQOV : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-RAMSQZ : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

Method	Runtime	Input	Mean	Ratio	Allocated
GetByteCount	.NET Core 3.1	EnglishAllAscii	38.00 us	1.00	-
GetByteCount	.NET Core 5.0	EnglishAllAscii	40.66 us	1.07	-

GetBytes	.NET Core 3.1	EnglishAllAscii	101.09 us	1.00	163840 B
GetBytes	.NET Core 5.0	EnglishAllAscii	104.96 us	1.04	163855 B

GetString	.NET Core 3.1	EnglishAllAscii	103.47 us	1.00	327648 B
GetString	.NET Core 5.0	EnglishAllAscii	95.76 us	0.93	327677 B

GetByteCount	.NET Core 3.1	EnglishMostlyAscii	117.50 us	1.00	-
GetByteCount	.NET Core 5.0	EnglishMostlyAscii	221.40 us	1.88	-

GetBytes	.NET Core 3.1	EnglishMostlyAscii	273.49 us	1.00	169880 B
GetBytes	.NET Core 5.0	EnglishMostlyAscii	377.67 us	1.38	169895 B

GetString	.NET Core 3.1	EnglishMostlyAscii	262.55 us	1.00	327656 B
GetString	.NET Core 5.0	EnglishMostlyAscii	250.18 us	0.95	327685 B

GetByteCount	.NET Core 3.1	Chinese	53.34 us	1.00	-
GetByteCount	.NET Core 5.0	Chinese	90.21 us	1.69	-

GetBytes	.NET Core 3.1	Chinese	245.94 us	1.00	177752 B
GetBytes	.NET Core 5.0	Chinese	279.62 us	1.14	177768 B

GetString	.NET Core 3.1	Chinese	373.80 us	1.00	150112 B
GetString	.NET Core 5.0	Chinese	358.11 us	0.96	150126 B

GetByteCount	.NET Core 3.1	Cyrillic	45.35 us	1.00	-
GetByteCount	.NET Core 5.0	Cyrillic	76.01 us	1.68	-

GetBytes	.NET Core 3.1	Cyrillic	193.34 us	1.00	100880 B
GetBytes	.NET Core 5.0	Cyrillic	222.10 us	1.15	100889 B

GetString	.NET Core 3.1	Cyrillic	262.69 us	1.00	130856 B
GetString	.NET Core 5.0	Cyrillic	259.83 us	0.99	130868 B

GetByteCount	.NET Core 3.1	Greek	58.36 us	1.00	-
GetByteCount	.NET Core 5.0	Greek	97.41 us	1.67	-

GetBytes	.NET Core 3.1	Greek	275.88 us	1.00	129248 B
GetBytes	.NET Core 5.0	Greek	314.00 us	1.14	129260 B

GetString	.NET Core 3.1	Greek	394.55 us	1.00	164264 B
GetString	.NET Core 5.0	Greek	394.35 us	1.00	164278 B

Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

cc @kunalspathak @carlossanlop @pgovind @tannergooding

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 37 (37 by maintainers)

Most upvoted comments

If we really need to do something in 5.0 and we’re running out of runway then the absolute safest thing to do would be to change the one line:

https://github.com/dotnet/runtime/blob/7d0d37001c5a0ac1f343cd35b27fea0ebf7e8101/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs#L82

To:

if (Sse2.IsSupported)

This will cause the UTF8Encoding.GetByteCount(string) method to fall back to the existing Vector<T>-based code paths as they existed in 3.1 instead of using the new intrinsics that were introduced in 5.0.

I assume that if we want to do this then we’d schedule a “proper” fix to come in during 5.0.1.

GrabYourPitchforks on Sep 9, 2020

@echesakovMSFT I’ll need to take a closer look, but as an initial assessment the temp += (temp >> 32); should be slightly better if you are generating an ADD with a shifted register. (as in, a single instruction rather than a separate add + shift).

That said looking at the algorithm, do you really need the need the reduction inside the loop? The value seems to really only be a counter. So instead can’t you keep the value as a vector128<uint> during the loop and perform the final addp and move to genreg side after the loop and add it to tempUtf8CodeUnitCountAdjustment ?

I think we should look at the function as a whole instead of piece wise.

For instance since the only things done on popcnt are add and sub there’s no need to transfer between register files in the loop.

+        private static Vector64<uint> CountNonAsciiBytes(Vector128<byte> vec)

and using AddScalar instead during the loop avoids the transfer as we can do scalar arithmetic on the SIMD side.

TamarChristinaArm on Sep 8, 2020

@jeffhandley I am assigning this to you now.

JulieLeeMSFT on Sep 4, 2020

Below measurement are done on

processor       : 0
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 1
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 2
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 3
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

.NET Core 3.1.6


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT
  Job-VJGWPE : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 3.1  Arguments=/p:DebugType=portable
Toolchain=netcoreapp3.1  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

Method	Input	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
GetByteCount	EnglishAllAscii	118.9 μs	0.24 μs	0.21 μs	118.9 μs	118.6 μs	119.4 μs	-	-	-	-
GetBytes	EnglishAllAscii	268.4 μs	2.14 μs	1.78 μs	267.6 μs	266.4 μs	272.4 μs	49.7881	49.7881	49.7881	163840 B
GetString	EnglishAllAscii	322.4 μs	1.78 μs	1.49 μs	321.7 μs	321.5 μs	326.5 μs	99.4898	99.4898	99.4898	327648 B
GetByteCount	EnglishMostlyAscii	328.3 μs	0.43 μs	0.39 μs	328.1 μs	328.0 μs	329.3 μs	-	-	-	-
GetBytes	EnglishMostlyAscii	680.3 μs	0.80 μs	0.62 μs	680.3 μs	679.3 μs	681.5 μs	51.6304	51.6304	51.6304	169880 B
GetString	EnglishMostlyAscii	591.1 μs	2.17 μs	1.92 μs	590.2 μs	588.7 μs	594.6 μs	99.5370	99.5370	99.5370	327656 B
GetByteCount	Chinese	149.9 μs	0.29 μs	0.26 μs	149.9 μs	149.7 μs	150.5 μs	-	-	-	-
GetBytes	Chinese	646.8 μs	1.21 μs	0.95 μs	647.0 μs	645.1 μs	647.9 μs	55.0000	55.0000	55.0000	177752 B
GetString	Chinese	943.7 μs	2.97 μs	2.63 μs	943.3 μs	940.9 μs	950.1 μs	44.1176	44.1176	44.1176	150112 B
GetByteCount	Cyrillic	130.3 μs	0.21 μs	0.19 μs	130.4 μs	130.0 μs	130.7 μs	-	-	-	-
GetBytes	Cyrillic	487.0 μs	1.30 μs	1.08 μs	486.8 μs	485.2 μs	489.2 μs	29.2969	29.2969	29.2969	100880 B
GetString	Cyrillic	648.8 μs	1.74 μs	1.45 μs	649.5 μs	646.6 μs	650.9 μs	39.0625	39.0625	39.0625	130856 B
GetByteCount	Greek	163.7 μs	0.08 μs	0.07 μs	163.7 μs	163.6 μs	163.8 μs	-	-	-	-
GetBytes	Greek	723.1 μs	3.83 μs	3.58 μs	721.8 μs	718.7 μs	728.8 μs	39.7727	39.7727	39.7727	129248 B
GetString	Greek	968.3 μs	8.42 μs	7.47 μs	965.3 μs	960.8 μs	980.7 μs	47.7941	47.7941	47.7941	164264 B

NET Core 5.0.0


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-OUMKUS : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

Method	Input	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
GetByteCount	EnglishAllAscii	109.6 μs	0.51 μs	0.47 μs	109.3 μs	109.3 μs	110.6 μs	-	-	-	-
GetBytes	EnglishAllAscii	261.1 μs	2.71 μs	2.53 μs	259.9 μs	258.7 μs	266.3 μs	48.9583	48.9583	48.9583	163854 B
GetString	EnglishAllAscii	205.6 μs	0.58 μs	0.46 μs	205.6 μs	204.8 μs	206.3 μs	99.5066	99.5066	99.5066	327677 B
GetByteCount	EnglishMostlyAscii	565.9 μs	0.93 μs	0.78 μs	565.9 μs	564.2 μs	567.2 μs	-	-	-	1 B
GetBytes	EnglishMostlyAscii	912.2 μs	1.65 μs	1.29 μs	912.0 μs	910.3 μs	914.8 μs	52.0833	52.0833	52.0833	169896 B
GetString	EnglishMostlyAscii	574.2 μs	1.69 μs	1.32 μs	573.6 μs	572.5 μs	576.9 μs	99.5370	99.5370	99.5370	327685 B
GetByteCount	Chinese	258.3 μs	0.83 μs	0.77 μs	257.9 μs	257.5 μs	259.8 μs	-	-	-	-
GetBytes	Chinese	749.3 μs	3.55 μs	3.14 μs	747.9 μs	746.7 μs	756.1 μs	53.5714	53.5714	53.5714	177768 B
GetString	Chinese	896.2 μs	9.65 μs	9.03 μs	891.6 μs	889.5 μs	914.0 μs	45.1389	45.1389	45.1389	150126 B
GetByteCount	Cyrillic	223.8 μs	0.46 μs	0.39 μs	223.6 μs	223.4 μs	224.8 μs	-	-	-	-
GetBytes	Cyrillic	592.1 μs	4.66 μs	4.13 μs	590.1 μs	588.8 μs	602.4 μs	30.0926	30.0926	30.0926	100889 B
GetString	Cyrillic	630.9 μs	1.32 μs	1.17 μs	630.6 μs	629.6 μs	633.7 μs	37.5000	37.5000	37.5000	130868 B
GetByteCount	Greek	281.7 μs	0.35 μs	0.31 μs	281.8 μs	281.2 μs	282.3 μs	-	-	-	-
GetBytes	Greek	844.0 μs	2.50 μs	2.09 μs	843.5 μs	841.8 μs	848.9 μs	39.4737	39.4737	39.4737	129260 B
GetString	Greek	951.6 μs	10.59 μs	9.91 μs	949.3 μs	941.3 μs	971.0 μs	47.7941	47.7941	47.7941	164279 B

NET Core 5.0.0 (with the suggested change to Utf16Utility.GetPointerToFirstInvalidChar)


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-XEOYJB : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

Method	Input	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
GetByteCount	EnglishAllAscii	109.2 μs	0.46 μs	0.41 μs	109.0 μs	108.8 μs	110.0 μs	-	-	-	-
GetBytes	EnglishAllAscii	260.3 μs	2.08 μs	1.74 μs	259.7 μs	259.0 μs	265.2 μs	48.9583	48.9583	48.9583	163854 B
GetString	EnglishAllAscii	207.3 μs	1.23 μs	1.03 μs	206.9 μs	206.3 μs	209.7 μs	99.5066	99.5066	99.5066	327677 B
GetByteCount	EnglishMostlyAscii	414.5 μs	0.49 μs	0.41 μs	414.3 μs	414.3 μs	415.6 μs	-	-	-	-
GetBytes	EnglishMostlyAscii	753.2 μs	3.51 μs	2.93 μs	752.4 μs	750.2 μs	759.9 μs	50.5952	50.5952	50.5952	169895 B
GetString	EnglishMostlyAscii	574.0 μs	8.49 μs	7.94 μs	570.8 μs	566.9 μs	590.0 μs	98.2143	98.2143	98.2143	327685 B
GetByteCount	Chinese	189.1 μs	0.16 μs	0.14 μs	189.1 μs	189.0 μs	189.5 μs	-	-	-	-
GetBytes	Chinese	675.3 μs	1.07 μs	0.89 μs	675.1 μs	673.9 μs	677.0 μs	54.3478	54.3478	54.3478	177768 B
GetString	Chinese	895.7 μs	6.66 μs	6.23 μs	892.0 μs	889.4 μs	904.7 μs	45.1389	45.1389	45.1389	150126 B
GetByteCount	Cyrillic	164.1 μs	0.11 μs	0.09 μs	164.1 μs	164.0 μs	164.3 μs	-	-	-	-
GetBytes	Cyrillic	527.7 μs	2.71 μs	2.40 μs	527.0 μs	525.4 μs	533.3 μs	29.1667	29.1667	29.1667	100889 B
GetString	Cyrillic	624.7 μs	2.56 μs	2.00 μs	624.6 μs	622.2 μs	630.4 μs	38.4615	38.4615	38.4615	130868 B
GetByteCount	Greek	206.9 μs	0.44 μs	0.39 μs	206.7 μs	206.6 μs	208.0 μs	-	-	-	-
GetBytes	Greek	764.2 μs	3.49 μs	3.27 μs	762.9 μs	760.4 μs	769.6 μs	38.6905	38.6905	38.6905	129260 B
GetString	Greek	962.2 μs	4.29 μs	3.59 μs	961.0 μs	958.4 μs	970.9 μs	46.8750	46.8750	46.8750	164279 B

It’s clear from the data for GetByteCount benchmark the issue with stalled cycles due to PopCount is one of potentially many causes of the performance regression here. We need to do thorough analysis to discover them all.

I am moving this to .NET 6.0. I don’t believe this is a JIT issue, so I am relabeling this back to area-System.Text.Encoding.

cc @JulieLeeMSFT @jeffhandley

echesakov on Sep 4, 2020