runtime: Performance regression: float.ToString(format) 20% to 3x slower

After the changes introduced to numbers formatting the vast majority of the operations is faster. However, this is not true for float.ToString(format)which is 20% to 3x slower.

https://github.com/dotnet/performance/blob/1b4c089465cde5d8a6454a2e53b37f662f3964b0/src/benchmarks/micro/corefx/System.Runtime/Perf.Single.cs#L42-L45

Repro

git clone https://github.com/dotnet/performance.git
cd performance
# if you don't have cli installed and want python script to download the latest cli for you
py .\scripts\benchmarks_ci.py -f netcoreapp2.2 netcoreapp3.0 --filter System.Tests.Perf_Single.ToStringWithFormat
# if you do
dotnet run -p .\src\benchmarks\micro\MicroBenchmarks.csproj -c Release -f netcoreapp2.2 --filter System.Tests.Perf_Single.ToStringWithFormat --runtimes netcoreapp2.2 netcoreapp3.0
BenchmarkDotNet=v0.11.3.1003-nightly, OS=Windows 10.0.18362
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview8-013262
   [Host]     : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
   Job-BYJCMJ : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
   Job-JSSCYO : .NET Core 3.0.0-preview8-27916-02 (CoreCLR 4.700.19.36302, CoreFX 4.700.19.36514), 64bit RyuJIT
Method Toolchain value format Mean Ratio Allocated Memory/Op
ToStringWithFormat netcoreapp2.2 -3,402823E+38 E 154.8 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 -3,402823E+38 E 193.7 ns 1.25 56 B
ToStringWithFormat netcoreapp2.2 -3,402823E+38 F50 447.2 ns 1.00 208 B
ToStringWithFormat netcoreapp3.0 -3,402823E+38 F50 1,475.5 ns 3.31 208 B
ToStringWithFormat netcoreapp2.2 -3,402823E+38 G 152.7 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 -3,402823E+38 G 215.2 ns 1.41 56 B
ToStringWithFormat netcoreapp2.2 -3,402823E+38 G17 160.2 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 -3,402823E+38 G17 238.2 ns 1.49 72 B
ToStringWithFormat netcoreapp2.2 -3,402823E+38 R 245.7 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 -3,402823E+38 R 216.4 ns 0.88 56 B
ToStringWithFormat netcoreapp2.2 12345 E 166.6 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 12345 E 213.2 ns 1.28 48 B
ToStringWithFormat netcoreapp2.2 12345 F50 318.9 ns 1.00 144 B
ToStringWithFormat netcoreapp3.0 12345 F50 448.9 ns 1.41 136 B
ToStringWithFormat netcoreapp2.2 12345 G 146.6 ns 1.00 40 B
ToStringWithFormat netcoreapp3.0 12345 G 183.4 ns 1.25 32 B
ToStringWithFormat netcoreapp2.2 12345 G17 161.9 ns 1.00 40 B
ToStringWithFormat netcoreapp3.0 12345 G17 349.4 ns 2.16 32 B
ToStringWithFormat netcoreapp2.2 12345 R 172.8 ns 1.00 40 B
ToStringWithFormat netcoreapp3.0 12345 R 185.1 ns 1.07 32 B
ToStringWithFormat netcoreapp2.2 3,402823E+38 E 149.5 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 3,402823E+38 E 188.5 ns 1.26 48 B
ToStringWithFormat netcoreapp2.2 3,402823E+38 F50 437.2 ns 1.00 208 B
ToStringWithFormat netcoreapp3.0 3,402823E+38 F50 1,523.3 ns 3.48 208 B
ToStringWithFormat netcoreapp2.2 3,402823E+38 G 151.5 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 3,402823E+38 G 212.8 ns 1.40 48 B
ToStringWithFormat netcoreapp2.2 3,402823E+38 G17 157.9 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 3,402823E+38 G17 237.0 ns 1.50 72 B
ToStringWithFormat netcoreapp2.2 3,402823E+38 R 243.0 ns 1.00 56 B
ToStringWithFormat netcoreapp3.0 3,402823E+38 R 213.8 ns 0.88 48 B

/cc @danmosemsft @tannergooding

category:cq theme:floating-point skill-level:expert cost:large

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 46 (46 by maintainers)

Most upvoted comments

We’re in the prolog, so making calls is problematic. It requires special care and usually some kind of bespoke calling convention (some native compilers do this for stack checks, for instance).

There’s nothing blocking us from generating different code to zero the slots. We are not GC live at this point so can use whatever instructions will work. But we also would like to minimize the set of registers used and any shuffling needed around the zeroing sequence. For example REP STOS needs RCX which is usually live at this point – so almost certainly the current heuristic is underestimating the cost of this kind of loop. It is worse on SysV.

So I think the way forward is

  • review heuristics for when to generate block stores for prolog zeroing, especially cases when there are structs with both GC and non-GC fields. Likely we should be generating inline sequences for more cases than we do now.
  • enable use of wider stores (say via XMM) to zero slots (which can decrease cost of inline sequences and make their use even more widespread) and update the heuristic accordingly.
  • reconsider whether all these stack slots really need to be untracked lifetimes. In some cases we might be able to defer the zeroing until sometime later in the method, or avoid it all together.

The extra prolog costs that come up because a struct is used somewhere in the method can make it hard to reason about struct perf.

GCC simply reuses the reminder returned by the DIV instruction.

My mistake, looks like it did pick it up

G_M61719_IG01:
       push     rbp
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 168
       movaps   qword ptr [rsp+90H], xmm6
       lea      rbp, [rsp+30H]
       xorps    xmm4, xmm4
       movaps   xmmword ptr [rbp+10H], xmm4
       movaps   xmmword ptr [rbp+20H], xmm4
       movaps   xmmword ptr [rbp+30H], xmm4
       movaps   xmmword ptr [rbp+40H], xmm4
       xor      rax, rax
       mov      qword ptr [rbp+50H], rax
       mov      rax, qword ptr [(reloc)]
       mov      qword ptr [rbp+08H], rax
       mov      rbx, rcx
       mov      rsi, r8
       mov      rdi, r9
       movaps   xmm6, xmm1

That sounds good/reasonable to me. It will also likely improve perf of various methods that are attempting to utilize things like Span/ValueStringBuilder as a perf optimization.

I’m going to mark this as future – nothing obvious for 3.0 jumps out here. I’ll keep looking though.