runtime: JIT Performance regression between 1.1 and 2.0 and between 2.0 and 2.1

I’ve been trying to come up with a simple repro for a perf regression I’ve been fighting, and I think I have it trimmed down as much as possible while still showing the real-world impact. What I have here is a stripped-down version of my Blake2b hashing algorithm implementation.

The mixSimplified method here shows the problem. Code size has grown with each new JIT since 1.1, and performance has dropped.

Runtime	x86 Code Size	x64 Code Size
netcoreapp1.1	1534	577
netcoreapp2.0	2906	602
netcoreapp2.1	2956	604
netcoreapp3.1	3057	632

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18362
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]        : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.13 (CoreCLR 4.6.27618.02, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.14 (CoreCLR 4.6.28207.04, CoreFX 4.6.28208.01), 64bit RyuJIT
  netcoreapp3.1 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT

Runtime	Platform	Mean	Error	StdDev	Scaled
netcoreapp1.1	X64	1.002 ms	0.0090 ms	0.0084 ms	1.00
netcoreapp2.0	X64	1.039 ms	0.0182 ms	0.0152 ms	1.04
netcoreapp2.1	X64	1.039 ms	0.0200 ms	0.0178 ms	1.04
netcoreapp3.1	X64	1.154 ms	0.0222 ms	0.0256 ms	1.15

netcoreapp1.1	X86	2.944 ms	0.0154 ms	0.0144 ms	1.00
netcoreapp2.0	X86	5.780 ms	0.0375 ms	0.0313 ms	1.96
netcoreapp2.1	X86	6.545 ms	0.0329 ms	0.0292 ms	2.22
netcoreapp3.1	X86	6.768 ms	0.0422 ms	0.0374 ms	2.30

JitDisasm and JitDump output for all versions are here: simplifiedhash_jitdumps.zip

category:cq theme:register-allocator skill-level:expert cost:large

About this issue

Original URL
State: open
Created 6 years ago
Comments: 31 (30 by maintainers)

Most upvoted comments

json-benchmark (java&.net) Java‘s DSL-JSON is so fast …

The benchmark almost include the fastest .net json library,like Jil,NetJson… There was hardly any .NET JSON library can be higher.

when could be CLR can be faster more than JVM?

sgf on Oct 4, 2018

I’ve looked a bit more into those null checks and the reason they’re not removed is pretty hilarious:

    [ 0] 161 (0x0a1) ldarg.0
    [ 1] 162 (0x0a2) ldflda 0400000A
    [ 1] 167 (0x0a7) ldflda 04000014
    [ 1] 172 (0x0ac) ldc.i4.2 2              ; constant 
    [ 2] 173 (0x0ad) conv.i                   ; constant 
    [ 2] 174 (0x0ae) ldc.i4.8 8              ; constant 
    [ 3] 175 (0x0af) mul                       ; constant 
    [ 2] 176 (0x0b0) add
    [ 1] 177 (0x0b1) ldind.i8
    [ 1] 178 (0x0b2) stloc.s 10

So the C# compiler somehow manages to emit a constant tree and the JIT gets rid of it too late:

fgMorphTree BB01, STMT00018 (before)
               [000183] -A-XG-------              *  ASG       long  
               [000182] D------N----              +--*  LCL_VAR   long   V12 loc10        
               [000181] *--XG-------              \--*  IND       long  
               [000180] ---XG-------                 \--*  ADD       long  
               [000174] ---XG-------                    +--*  ADDR      long  
               [000173] ---XG-------                    |  \--*  FIELD     long   FixedElementField
               [000172] ---XG-------                    |     \--*  ADDR      long  
               [000171] ---XG-------                    |        \--*  FIELD     struct h
               [000170] ------------                    |           \--*  LCL_VAR   long   V00 arg0         
               [000179] ------------                    \--*  MUL       long  
               [000176] ------------                       +--*  CAST      long <- int
               [000175] ------------                       |  \--*  CNS_INT   int    2
               [000178] ------------                       \--*  CAST      long <- int
               [000177] ------------                          \--*  CNS_INT   int    8

Before calling fgAddFieldSeqForZeroOffset:
               [000173] ---XG--N----              *  IND       long  
               [000172] ---XG-------              \--*  ADDR      long  
               [000171] ---XG-------                 \--*  FIELD     struct h
               [000170] ------------                    \--*  LCL_VAR   long   V00 arg0         

fgAddFieldSeqForZeroOffset for Fseq[FixedElementField]
addr (Before)
               [000172] ---XG-------                ADDR      long  
     (After)
               [000172] ---XG-------                ADDR      long   Zero Fseq[FixedElementField]
Before explicit null check morphing:
               [000171] ---XG--N----              *  FIELD     struct h
               [000170] ------------              \--*  LCL_VAR   long   V00 arg0         
After adding explicit null check:
               [000171] ---XG--N----              *  IND       struct
               [000694] ---X--------              \--*  COMMA     long  
               [000690] ---X---N----                 +--*  NULLCHECK byte  
               [000689] ------------                 |  \--*  LCL_VAR   long   V00 arg0

The reason for the null check is mac->m_allConstantOffsets being false due to that constant tree that’s not yet a CNS_INT.

I haven’t checked how come this worked before 3.0 nor if there’s any quick fix for this. It looks to me that the main reason for the MAC’s mess is the existence of ADDR nodes so… back to “get rid of ADDR” work.

The generated code also contains a bunch of extra LEAs that I suspect my trivial forward substitution experiment may eliminate, I’ll check that another time.

mikedn on Dec 4, 2019

The x86 issue is the same as dotnet/runtime#8846, and is something I plan to be working on soon (see dotnet/runtime#9399). The former is a loop, and may well need additional work beyond what’s proposed in dotnet/runtime#9399, but it suffers from the same fundamental issue that the register allocator will use a bad register that’s free (even one that will immediately have to be spilled) rather than spill a register that’s occupied.

CarolEidt on Dec 4, 2019