runtime: [ARM64] Performance regression: Utf8Encoding
After running benchmarks for 3.1 vs 5.0 using “Ubuntu arm64 Qualcomm Machines” owned by the JIT Team, I’ve found few regressions related to Utf8Encoding
. They are alll reproducible and I’ve verified that it’s not a matter of loop alignment (by running them with --envVars COMPlus_JitAlignLoops:1
).
It looks like it’s ARM64 specific regression, I was not able to reproduce it for ARM (the 32 bit variant).
Repro
git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter Perf_Utf8Encoding
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04 Unknown processor [Host] : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-VTSQOV : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-RAMSQZ : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
Method | Runtime | Input | Mean | Ratio | Allocated |
---|---|---|---|---|---|
GetByteCount | .NET Core 3.1 | EnglishAllAscii | 38.00 us | 1.00 | - |
GetByteCount | .NET Core 5.0 | EnglishAllAscii | 40.66 us | 1.07 | - |
GetBytes | .NET Core 3.1 | EnglishAllAscii | 101.09 us | 1.00 | 163840 B |
GetBytes | .NET Core 5.0 | EnglishAllAscii | 104.96 us | 1.04 | 163855 B |
GetString | .NET Core 3.1 | EnglishAllAscii | 103.47 us | 1.00 | 327648 B |
GetString | .NET Core 5.0 | EnglishAllAscii | 95.76 us | 0.93 | 327677 B |
GetByteCount | .NET Core 3.1 | EnglishMostlyAscii | 117.50 us | 1.00 | - |
GetByteCount | .NET Core 5.0 | EnglishMostlyAscii | 221.40 us | 1.88 | - |
GetBytes | .NET Core 3.1 | EnglishMostlyAscii | 273.49 us | 1.00 | 169880 B |
GetBytes | .NET Core 5.0 | EnglishMostlyAscii | 377.67 us | 1.38 | 169895 B |
GetString | .NET Core 3.1 | EnglishMostlyAscii | 262.55 us | 1.00 | 327656 B |
GetString | .NET Core 5.0 | EnglishMostlyAscii | 250.18 us | 0.95 | 327685 B |
GetByteCount | .NET Core 3.1 | Chinese | 53.34 us | 1.00 | - |
GetByteCount | .NET Core 5.0 | Chinese | 90.21 us | 1.69 | - |
GetBytes | .NET Core 3.1 | Chinese | 245.94 us | 1.00 | 177752 B |
GetBytes | .NET Core 5.0 | Chinese | 279.62 us | 1.14 | 177768 B |
GetString | .NET Core 3.1 | Chinese | 373.80 us | 1.00 | 150112 B |
GetString | .NET Core 5.0 | Chinese | 358.11 us | 0.96 | 150126 B |
GetByteCount | .NET Core 3.1 | Cyrillic | 45.35 us | 1.00 | - |
GetByteCount | .NET Core 5.0 | Cyrillic | 76.01 us | 1.68 | - |
GetBytes | .NET Core 3.1 | Cyrillic | 193.34 us | 1.00 | 100880 B |
GetBytes | .NET Core 5.0 | Cyrillic | 222.10 us | 1.15 | 100889 B |
GetString | .NET Core 3.1 | Cyrillic | 262.69 us | 1.00 | 130856 B |
GetString | .NET Core 5.0 | Cyrillic | 259.83 us | 0.99 | 130868 B |
GetByteCount | .NET Core 3.1 | Greek | 58.36 us | 1.00 | - |
GetByteCount | .NET Core 5.0 | Greek | 97.41 us | 1.67 | - |
GetBytes | .NET Core 3.1 | Greek | 275.88 us | 1.00 | 129248 B |
GetBytes | .NET Core 5.0 | Greek | 314.00 us | 1.14 | 129260 B |
GetString | .NET Core 3.1 | Greek | 394.55 us | 1.00 | 164264 B |
GetString | .NET Core 5.0 | Greek | 394.35 us | 1.00 | 164278 B |
Docs
Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 37 (37 by maintainers)
If we really need to do something in 5.0 and we’re running out of runway then the absolute safest thing to do would be to change the one line:
https://github.com/dotnet/runtime/blob/7d0d37001c5a0ac1f343cd35b27fea0ebf7e8101/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs#L82
To:
This will cause the
UTF8Encoding.GetByteCount(string)
method to fall back to the existingVector<T>
-based code paths as they existed in 3.1 instead of using the new intrinsics that were introduced in 5.0.I assume that if we want to do this then we’d schedule a “proper” fix to come in during 5.0.1.
@echesakovMSFT I’ll need to take a closer look, but as an initial assessment the
temp += (temp >> 32);
should be slightly better if you are generating anADD
with a shifted register. (as in, a single instruction rather than a separateadd
+shift
).That said looking at the algorithm, do you really need the need the reduction inside the loop? The value seems to really only be a counter. So instead can’t you keep the value as a
vector128<uint>
during the loop and perform the finaladdp
and move to genreg side after the loop and add it totempUtf8CodeUnitCountAdjustment
?I think we should look at the function as a whole instead of piece wise.
For instance since the only things done on
popcnt
areadd
andsub
there’s no need to transfer between register files in the loop.and using
AddScalar
instead during the loop avoids the transfer as we can do scalar arithmetic on the SIMD side.@jeffhandley I am assigning this to you now.
Below measurement are done on
.NET Core 3.1.6
NET Core 5.0.0
NET Core 5.0.0 (with the suggested change to
Utf16Utility.GetPointerToFirstInvalidChar
)It’s clear from the data for GetByteCount benchmark the issue with stalled cycles due to
PopCount
is one of potentially many causes of the performance regression here. We need to do thorough analysis to discover them all.I am moving this to .NET 6.0. I don’t believe this is a JIT issue, so I am relabeling this back to area-System.Text.Encoding.
cc @JulieLeeMSFT @jeffhandley