runtime: [Perf] Regressions in Integer Formatting

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_Int32

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ToString - Duration of single invocation 24.70 ns 34.44 ns 1.39 0.21
TryFormat - Duration of single invocation 19.39 ns 21.12 ns 1.09 0.17

graph graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Int32*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)


System.Tests.Perf_Int32.TryFormat(value: 2147483647)


Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_UInt64

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryFormat - Duration of single invocation 10.78 ns 12.11 ns 1.12 0.13

graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt64*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)


Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare 8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff Diff

Regressions in System.Tests.Perf_Version

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryFormatL - Duration of single invocation 96.02 ns 103.50 ns 1.08 0.15

graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Version*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_Version.TryFormatL


Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 28e63279342bd2f6ca43442d864f487613a53bc9
Compare d49bcbe0441f5c954cddcbe28a222eb34917bcaf
Diff Diff

Regressions in System.Tests.Perf_UInt32

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ToString - Duration of single invocation 22.92 ns 25.28 ns 1.10 0.24
TryFormat - Duration of single invocation 17.54 ns 20.77 ns 1.18 0.01

graph graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt32*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)


System.Tests.Perf_UInt32.TryFormat(value: 4294967295)


Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (20 by maintainers)

Most upvoted comments

As a side note, while looking at the diff I noticed some other general improvements that could be made in the future:

sxtw    x3, w6	
lsl     x3, x3, #1	
add     x2, x2, x3

can become

add     x2, x2, w6, sxtw, #1

and

lsl     w3, w3, #1
sub     w3, w0, w3

can become

sub     w3, w0, w3, lsl 1

lastly

sub     x2, x2, #2
strh    w1, [x2]

can become

strh    w1, [x2, #-2]!

and probably want to move the constant out of the loop.

So I managed to finally generate a diff: https://www.diffchecker.com/Rr7cRI2V I suspect the slowdown comes from using mul Xd, Xn, Xm instead of umull Xd, Wn, Wm?

I think this is correct culprid, the loop has a long dependency chain off the value out of the mul/umull with little independent instructions so it’s bound by these two and 64-bit mul has twice the latency of umull.

Edit: ARM software optimization guides seem to indicate that mul and umull have identical latencies/throughput. umulh is more expensive, but it’s only used once and it should still be faster than a bunch of lsr/sub/add instructions…

64-bit mul has a latency of 4(3) and umull has 2(1) on Cortex-A76. Were you perhaps comparing the AArch32 instructions? Note that on AArch64 mul and umull aren’t real instructions but are architectural aliases for madd and umaddl respectively so you need to look at the latencies for those.

The mul seems to be unneeded though, both inputs are 32-bits so you should be able to just use umull there as before.

Test history implicate #52893. In particular the change was merged, reverted, and then re-merged, which matches the up-down-up seen here:

newplot (66)

As for how to get diffs, it should be something like:

;; builds as you do above, in both "ref" and "diff" repos

jit-diff diff --base --base_root <path to ref repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-diff diff --diff --diff_root <path to diff repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-analyze --base X\base --diff X\diff