runtime: [Perf] Regressions in Integer Formatting

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_Int32

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ToString - Duration of single invocation	24.70 ns	34.44 ns	1.39	0.21
TryFormat - Duration of single invocation	19.39 ns	21.12 ns	1.09	0.17

graph graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Int32*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_Int32.ToString(value: 2147483647)

System.Tests.Perf_Int32.TryFormat(value: 2147483647)

Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_UInt64

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
TryFormat - Duration of single invocation	10.78 ns	12.11 ns	1.12	0.13

graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt64*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_UInt64.TryFormat(value: 12345)

Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	4c7f4ee74d7a9e5bcc73b552dd020df02b039c8a
Compare	8fb2eb53f75971ce492f14bebb93ed7a236bcd6e
Diff	Diff

Regressions in System.Tests.Perf_Version

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
TryFormatL - Duration of single invocation	96.02 ns	103.50 ns	1.08	0.15

graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_Version*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_Version.TryFormatL

Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	arm64
OS	Windows 10.0.19041
Baseline	28e63279342bd2f6ca43442d864f487613a53bc9
Compare	d49bcbe0441f5c954cddcbe28a222eb34917bcaf
Diff	Diff

Regressions in System.Tests.Perf_UInt32

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ToString - Duration of single invocation	22.92 ns	25.28 ns	1.10	0.24
TryFormat - Duration of single invocation	17.54 ns	20.77 ns	1.18	0.01

graph graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_UInt32*'

Payloads

Baseline Compare

Histogram

System.Tests.Perf_UInt32.ToString(value: 4294967295)

System.Tests.Perf_UInt32.TryFormat(value: 4294967295)

Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (20 by maintainers)

Most upvoted comments

As a side note, while looking at the diff I noticed some other general improvements that could be made in the future:

sxtw    x3, w6	
lsl     x3, x3, #1	
add     x2, x2, x3

can become

add     x2, x2, w6, sxtw, #1

and

lsl     w3, w3, #1
sub     w3, w0, w3

can become

sub     w3, w0, w3, lsl 1

lastly

sub     x2, x2, #2
strh    w1, [x2]

can become

strh    w1, [x2, #-2]!

and probably want to move the constant out of the loop.

TamarChristinaArm on Aug 16, 2021

So I managed to finally generate a diff: https://www.diffchecker.com/Rr7cRI2V I suspect the slowdown comes from using mul Xd, Xn, Xm instead of umull Xd, Wn, Wm?

I think this is correct culprid, the loop has a long dependency chain off the value out of the mul/umull with little independent instructions so it’s bound by these two and 64-bit mul has twice the latency of umull.

Edit: ARM software optimization guides seem to indicate that mul and umull have identical latencies/throughput. umulh is more expensive, but it’s only used once and it should still be faster than a bunch of lsr/sub/add instructions…

64-bit mul has a latency of 4(3) and umull has 2(1) on Cortex-A76. Were you perhaps comparing the AArch32 instructions? Note that on AArch64 mul and umull aren’t real instructions but are architectural aliases for madd and umaddl respectively so you need to look at the latencies for those.

The mul seems to be unneeded though, both inputs are 32-bits so you should be able to just use umull there as before.

TamarChristinaArm on Aug 13, 2021

Test history implicate #52893. In particular the change was merged, reverted, and then re-merged, which matches the up-down-up seen here:

newplot (66)

As for how to get diffs, it should be something like:

;; builds as you do above, in both "ref" and "diff" repos

jit-diff diff --base --base_root <path to ref repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-diff diff --diff --diff_root <path to diff repo> --core_root <path to release core_root in either repo> --altjit clrjit_win_arm64_x64.dll -t X --pmi

jit-analyze --base X\base --diff X\diff

AndyAyersMS on Aug 12, 2021