runtime: Performance regression: float.ToString(format) 20% to 3x slower

After the changes introduced to numbers formatting the vast majority of the operations is faster. However, this is not true for float.ToString(format)which is 20% to 3x slower.

https://github.com/dotnet/performance/blob/1b4c089465cde5d8a6454a2e53b37f662f3964b0/src/benchmarks/micro/corefx/System.Runtime/Perf.Single.cs#L42-L45

Repro

git clone https://github.com/dotnet/performance.git
cd performance
# if you don't have cli installed and want python script to download the latest cli for you
py .\scripts\benchmarks_ci.py -f netcoreapp2.2 netcoreapp3.0 --filter System.Tests.Perf_Single.ToStringWithFormat
# if you do
dotnet run -p .\src\benchmarks\micro\MicroBenchmarks.csproj -c Release -f netcoreapp2.2 --filter System.Tests.Perf_Single.ToStringWithFormat --runtimes netcoreapp2.2 netcoreapp3.0

BenchmarkDotNet=v0.11.3.1003-nightly, OS=Windows 10.0.18362
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview8-013262
   [Host]     : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
   Job-BYJCMJ : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
   Job-JSSCYO : .NET Core 3.0.0-preview8-27916-02 (CoreCLR 4.700.19.36302, CoreFX 4.700.19.36514), 64bit RyuJIT

Method	Toolchain	value	format	Mean	Ratio	Allocated Memory/Op
ToStringWithFormat	netcoreapp2.2	-3,402823E+38	E	154.8 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	-3,402823E+38	E	193.7 ns	1.25	56 B

ToStringWithFormat	netcoreapp2.2	-3,402823E+38	F50	447.2 ns	1.00	208 B
ToStringWithFormat	netcoreapp3.0	-3,402823E+38	F50	1,475.5 ns	3.31	208 B

ToStringWithFormat	netcoreapp2.2	-3,402823E+38	G	152.7 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	-3,402823E+38	G	215.2 ns	1.41	56 B

ToStringWithFormat	netcoreapp2.2	-3,402823E+38	G17	160.2 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	-3,402823E+38	G17	238.2 ns	1.49	72 B

ToStringWithFormat	netcoreapp2.2	-3,402823E+38	R	245.7 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	-3,402823E+38	R	216.4 ns	0.88	56 B

ToStringWithFormat	netcoreapp2.2	12345	E	166.6 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	12345	E	213.2 ns	1.28	48 B

ToStringWithFormat	netcoreapp2.2	12345	F50	318.9 ns	1.00	144 B
ToStringWithFormat	netcoreapp3.0	12345	F50	448.9 ns	1.41	136 B

ToStringWithFormat	netcoreapp2.2	12345	G	146.6 ns	1.00	40 B
ToStringWithFormat	netcoreapp3.0	12345	G	183.4 ns	1.25	32 B

ToStringWithFormat	netcoreapp2.2	12345	G17	161.9 ns	1.00	40 B
ToStringWithFormat	netcoreapp3.0	12345	G17	349.4 ns	2.16	32 B

ToStringWithFormat	netcoreapp2.2	12345	R	172.8 ns	1.00	40 B
ToStringWithFormat	netcoreapp3.0	12345	R	185.1 ns	1.07	32 B

ToStringWithFormat	netcoreapp2.2	3,402823E+38	E	149.5 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	3,402823E+38	E	188.5 ns	1.26	48 B

ToStringWithFormat	netcoreapp2.2	3,402823E+38	F50	437.2 ns	1.00	208 B
ToStringWithFormat	netcoreapp3.0	3,402823E+38	F50	1,523.3 ns	3.48	208 B

ToStringWithFormat	netcoreapp2.2	3,402823E+38	G	151.5 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	3,402823E+38	G	212.8 ns	1.40	48 B

ToStringWithFormat	netcoreapp2.2	3,402823E+38	G17	157.9 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	3,402823E+38	G17	237.0 ns	1.50	72 B

ToStringWithFormat	netcoreapp2.2	3,402823E+38	R	243.0 ns	1.00	56 B
ToStringWithFormat	netcoreapp3.0	3,402823E+38	R	213.8 ns	0.88	48 B

/cc @danmosemsft @tannergooding

category:cq theme:floating-point skill-level:expert cost:large

About this issue

Original URL
State: open
Created 5 years ago
Comments: 46 (46 by maintainers)

Most upvoted comments

We’re in the prolog, so making calls is problematic. It requires special care and usually some kind of bespoke calling convention (some native compilers do this for stack checks, for instance).

There’s nothing blocking us from generating different code to zero the slots. We are not GC live at this point so can use whatever instructions will work. But we also would like to minimize the set of registers used and any shuffling needed around the zeroing sequence. For example REP STOS needs RCX which is usually live at this point – so almost certainly the current heuristic is underestimating the cost of this kind of loop. It is worse on SysV.

So I think the way forward is

review heuristics for when to generate block stores for prolog zeroing, especially cases when there are structs with both GC and non-GC fields. Likely we should be generating inline sequences for more cases than we do now.
enable use of wider stores (say via XMM) to zero slots (which can decrease cost of inline sequences and make their use even more widespread) and update the heuristic accordingly.
reconsider whether all these stack slots really need to be untracked lifetimes. In some cases we might be able to defer the zeroing until sometime later in the method, or avoid it all together.

The extra prolog costs that come up because a struct is used somewhere in the method can make it hard to reason about struct perf.

AndyAyersMS on Jul 18, 2019

GCC simply reuses the reminder returned by the DIV instruction.

mikedn on Jul 18, 2019

My mistake, looks like it did pick it up

G_M61719_IG01:
       push     rbp
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 168
       movaps   qword ptr [rsp+90H], xmm6
       lea      rbp, [rsp+30H]
       xorps    xmm4, xmm4
       movaps   xmmword ptr [rbp+10H], xmm4
       movaps   xmmword ptr [rbp+20H], xmm4
       movaps   xmmword ptr [rbp+30H], xmm4
       movaps   xmmword ptr [rbp+40H], xmm4
       xor      rax, rax
       mov      qword ptr [rbp+50H], rax
       mov      rax, qword ptr [(reloc)]
       mov      qword ptr [rbp+08H], rax
       mov      rbx, rcx
       mov      rsi, r8
       mov      rdi, r9
       movaps   xmm6, xmm1

benaadams on Mar 6, 2020

That sounds good/reasonable to me. It will also likely improve perf of various methods that are attempting to utilize things like Span/ValueStringBuilder as a perf optimization.

tannergooding on Jul 18, 2019

I’m going to mark this as future – nothing obvious for 3.0 jumps out here. I’ll keep looking though.

AndyAyersMS on Jul 17, 2019