runtime: JIT fails to inline methods called from a large/complex outer method
I’m seeing a significant perf regression between 2.1 and 3.0 for this complex hashing function https://github.com/saucecontrol/Blake2Fast/blob/a140aaba46a8aa3003303c4aef62a472dd8ab4a3/src/Blake2Fast/Blake2bSse4.cs#L186-L503
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC
.NET Core SDK=3.0.100-preview-010184
[Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
netcoreapp3.0 : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
Jit=RyuJit Toolchain=Default
| Method | Job | Platform | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|
| Blake2bFast | netcoreapp2.1 | X64 | 11.79 ms | 0.1075 ms | 0.1005 ms | 0 B |
| Blake2bFast | netcoreapp3.0 | X64 | 52.02 ms | 0.6522 ms | 0.6101 ms | 0 B |
| Blake2bFast | netcoreapp2.1 | X86 | 16.14 ms | 0.1888 ms | 0.1766 ms | 0 B |
| Blake2bFast | netcoreapp3.0 | X86 | 80.83 ms | 2.0971 ms | 6.1834 ms | 0 B |
In looking at the generated assembly, it appears RyuJIT produces good code up until about midway through the third round of mixing, and then switches from inlining the mixing functions (g1 and g2) to calling them instead.
I had to make some changes to the 3.0 version of the code, but those amounted to nothing more than replacing the older StaticCast calls with the new As variants. To confirm the codegen around the API changes wasn’t the issue, I commented out rounds 3-12 in the main function, and performance returned to 2.1 levels. So it appears the complexity of the main function is the issue.
I spotted https://github.com/dotnet/coreclr/pull/21893 which mentions a change to the inlining budget allowed by the JIT, but I wasn’t able to find where the ‘budget’ logic was added or whether any part of it was included in 2.1. I can see that in 3.0 Preview 1, this function is about 10x slower than 2.1 and with the nightly builds the delta dropped to 4-5x, so it seems to be related somehow.
Obviously in this case, the performance implications of ignoring the MethodImplOptions.AggressiveInlining hints are quite catastrophic, so I’m wondering if there is a way to truly force inlining or to adjust the budget the JIT allows. The function’s complexity can’t be reduced, and because of the heavy use of intrinsics, the generated code is a small fraction of what it appears to be in C# or IL.
@AndyAyersMS any ideas?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (21 by maintainers)
Seems like a case for transforming those vector methods into extension methods.