runtime: JIT Performance regression between 1.1 and 2.0 and between 2.0 and 2.1
I’ve been trying to come up with a simple repro for a perf regression I’ve been fighting, and I think I have it trimmed down as much as possible while still showing the real-world impact. What I have here is a stripped-down version of my Blake2b hashing algorithm implementation.
The mixSimplified method here shows the problem. Code size has grown with each new JIT since 1.1, and performance has dropped.
| Runtime | x86 Code Size | x64 Code Size |
|---|---|---|
| netcoreapp1.1 | 1534 | 577 |
| netcoreapp2.0 | 2906 | 602 |
| netcoreapp2.1 | 2956 | 604 |
| netcoreapp3.1 | 3057 | 632 |
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18362
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
[Host] : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
netcoreapp1.1 : .NET Core 1.1.13 (CoreCLR 4.6.27618.02, CoreFX 4.6.24705.01), 64bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.14 (CoreCLR 4.6.28207.04, CoreFX 4.6.28208.01), 64bit RyuJIT
netcoreapp3.1 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
| Runtime | Platform | Mean | Error | StdDev | Scaled |
|---|---|---|---|---|---|
| netcoreapp1.1 | X64 | 1.002 ms | 0.0090 ms | 0.0084 ms | 1.00 |
| netcoreapp2.0 | X64 | 1.039 ms | 0.0182 ms | 0.0152 ms | 1.04 |
| netcoreapp2.1 | X64 | 1.039 ms | 0.0200 ms | 0.0178 ms | 1.04 |
| netcoreapp3.1 | X64 | 1.154 ms | 0.0222 ms | 0.0256 ms | 1.15 |
| netcoreapp1.1 | X86 | 2.944 ms | 0.0154 ms | 0.0144 ms | 1.00 |
| netcoreapp2.0 | X86 | 5.780 ms | 0.0375 ms | 0.0313 ms | 1.96 |
| netcoreapp2.1 | X86 | 6.545 ms | 0.0329 ms | 0.0292 ms | 2.22 |
| netcoreapp3.1 | X86 | 6.768 ms | 0.0422 ms | 0.0374 ms | 2.30 |
JitDisasm and JitDump output for all versions are here: simplifiedhash_jitdumps.zip
category:cq theme:register-allocator skill-level:expert cost:large
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 31 (30 by maintainers)
json-benchmark (java&.net) Java‘s DSL-JSON is so fast …
The benchmark almost include the fastest .net json library,like Jil,NetJson… There was hardly any .NET JSON library can be higher.
when could be CLR can be faster more than JVM?
I’ve looked a bit more into those null checks and the reason they’re not removed is pretty hilarious:
So the C# compiler somehow manages to emit a constant tree and the JIT gets rid of it too late:
The reason for the null check is
mac->m_allConstantOffsetsbeingfalsedue to that constant tree that’s not yet aCNS_INT.I haven’t checked how come this worked before 3.0 nor if there’s any quick fix for this. It looks to me that the main reason for the MAC’s mess is the existence of
ADDRnodes so… back to “get rid of ADDR” work.The generated code also contains a bunch of extra
LEAs that I suspect my trivial forward substitution experiment may eliminate, I’ll check that another time.The x86 issue is the same as dotnet/runtime#8846, and is something I plan to be working on soon (see dotnet/runtime#9399). The former is a loop, and may well need additional work beyond what’s proposed in dotnet/runtime#9399, but it suffers from the same fundamental issue that the register allocator will use a bad register that’s free (even one that will immediately have to be spilled) rather than spill a register that’s occupied.