runtime: Performance regression: float.ToString(format) 20% to 3x slower
After the changes introduced to numbers formatting the vast majority of the operations is faster. However, this is not true for float.ToString(format)
which is 20% to 3x slower.
Repro
git clone https://github.com/dotnet/performance.git
cd performance
# if you don't have cli installed and want python script to download the latest cli for you
py .\scripts\benchmarks_ci.py -f netcoreapp2.2 netcoreapp3.0 --filter System.Tests.Perf_Single.ToStringWithFormat
# if you do
dotnet run -p .\src\benchmarks\micro\MicroBenchmarks.csproj -c Release -f netcoreapp2.2 --filter System.Tests.Perf_Single.ToStringWithFormat --runtimes netcoreapp2.2 netcoreapp3.0
BenchmarkDotNet=v0.11.3.1003-nightly, OS=Windows 10.0.18362
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview8-013262
[Host] : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
Job-BYJCMJ : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
Job-JSSCYO : .NET Core 3.0.0-preview8-27916-02 (CoreCLR 4.700.19.36302, CoreFX 4.700.19.36514), 64bit RyuJIT
Method | Toolchain | value | format | Mean | Ratio | Allocated Memory/Op |
---|---|---|---|---|---|---|
ToStringWithFormat | netcoreapp2.2 | -3,402823E+38 | E | 154.8 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | -3,402823E+38 | E | 193.7 ns | 1.25 | 56 B |
ToStringWithFormat | netcoreapp2.2 | -3,402823E+38 | F50 | 447.2 ns | 1.00 | 208 B |
ToStringWithFormat | netcoreapp3.0 | -3,402823E+38 | F50 | 1,475.5 ns | 3.31 | 208 B |
ToStringWithFormat | netcoreapp2.2 | -3,402823E+38 | G | 152.7 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | -3,402823E+38 | G | 215.2 ns | 1.41 | 56 B |
ToStringWithFormat | netcoreapp2.2 | -3,402823E+38 | G17 | 160.2 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | -3,402823E+38 | G17 | 238.2 ns | 1.49 | 72 B |
ToStringWithFormat | netcoreapp2.2 | -3,402823E+38 | R | 245.7 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | -3,402823E+38 | R | 216.4 ns | 0.88 | 56 B |
ToStringWithFormat | netcoreapp2.2 | 12345 | E | 166.6 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | 12345 | E | 213.2 ns | 1.28 | 48 B |
ToStringWithFormat | netcoreapp2.2 | 12345 | F50 | 318.9 ns | 1.00 | 144 B |
ToStringWithFormat | netcoreapp3.0 | 12345 | F50 | 448.9 ns | 1.41 | 136 B |
ToStringWithFormat | netcoreapp2.2 | 12345 | G | 146.6 ns | 1.00 | 40 B |
ToStringWithFormat | netcoreapp3.0 | 12345 | G | 183.4 ns | 1.25 | 32 B |
ToStringWithFormat | netcoreapp2.2 | 12345 | G17 | 161.9 ns | 1.00 | 40 B |
ToStringWithFormat | netcoreapp3.0 | 12345 | G17 | 349.4 ns | 2.16 | 32 B |
ToStringWithFormat | netcoreapp2.2 | 12345 | R | 172.8 ns | 1.00 | 40 B |
ToStringWithFormat | netcoreapp3.0 | 12345 | R | 185.1 ns | 1.07 | 32 B |
ToStringWithFormat | netcoreapp2.2 | 3,402823E+38 | E | 149.5 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | 3,402823E+38 | E | 188.5 ns | 1.26 | 48 B |
ToStringWithFormat | netcoreapp2.2 | 3,402823E+38 | F50 | 437.2 ns | 1.00 | 208 B |
ToStringWithFormat | netcoreapp3.0 | 3,402823E+38 | F50 | 1,523.3 ns | 3.48 | 208 B |
ToStringWithFormat | netcoreapp2.2 | 3,402823E+38 | G | 151.5 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | 3,402823E+38 | G | 212.8 ns | 1.40 | 48 B |
ToStringWithFormat | netcoreapp2.2 | 3,402823E+38 | G17 | 157.9 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | 3,402823E+38 | G17 | 237.0 ns | 1.50 | 72 B |
ToStringWithFormat | netcoreapp2.2 | 3,402823E+38 | R | 243.0 ns | 1.00 | 56 B |
ToStringWithFormat | netcoreapp3.0 | 3,402823E+38 | R | 213.8 ns | 0.88 | 48 B |
/cc @danmosemsft @tannergooding
category:cq theme:floating-point skill-level:expert cost:large
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 46 (46 by maintainers)
We’re in the prolog, so making calls is problematic. It requires special care and usually some kind of bespoke calling convention (some native compilers do this for stack checks, for instance).
There’s nothing blocking us from generating different code to zero the slots. We are not GC live at this point so can use whatever instructions will work. But we also would like to minimize the set of registers used and any shuffling needed around the zeroing sequence. For example REP STOS needs RCX which is usually live at this point – so almost certainly the current heuristic is underestimating the cost of this kind of loop. It is worse on SysV.
So I think the way forward is
The extra prolog costs that come up because a struct is used somewhere in the method can make it hard to reason about struct perf.
GCC simply reuses the reminder returned by the DIV instruction.
My mistake, looks like it did pick it up
That sounds good/reasonable to me. It will also likely improve perf of various methods that are attempting to utilize things like Span/ValueStringBuilder as a perf optimization.
I’m going to mark this as future – nothing obvious for 3.0 jumps out here. I’ll keep looking though.