runtime: JIT fails to inline methods called from a large/complex outer method

I’m seeing a significant perf regression between 2.1 and 3.0 for this complex hashing function https://github.com/saucecontrol/Blake2Fast/blob/a140aaba46a8aa3003303c4aef62a472dd8ab4a3/src/Blake2Fast/Blake2bSse4.cs#L186-L503


BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC
.NET Core SDK=3.0.100-preview-010184
  [Host]        : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp3.0 : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT

Jit=RyuJit  Toolchain=Default

Method	Job	Platform	Mean	Error	StdDev	Allocated
Blake2bFast	netcoreapp2.1	X64	11.79 ms	0.1075 ms	0.1005 ms	0 B
Blake2bFast	netcoreapp3.0	X64	52.02 ms	0.6522 ms	0.6101 ms	0 B

Blake2bFast	netcoreapp2.1	X86	16.14 ms	0.1888 ms	0.1766 ms	0 B
Blake2bFast	netcoreapp3.0	X86	80.83 ms	2.0971 ms	6.1834 ms	0 B

In looking at the generated assembly, it appears RyuJIT produces good code up until about midway through the third round of mixing, and then switches from inlining the mixing functions (g1 and g2) to calling them instead.

I had to make some changes to the 3.0 version of the code, but those amounted to nothing more than replacing the older StaticCast calls with the new As variants. To confirm the codegen around the API changes wasn’t the issue, I commented out rounds 3-12 in the main function, and performance returned to 2.1 levels. So it appears the complexity of the main function is the issue.

I spotted https://github.com/dotnet/coreclr/pull/21893 which mentions a change to the inlining budget allowed by the JIT, but I wasn’t able to find where the ‘budget’ logic was added or whether any part of it was included in 2.1. I can see that in 3.0 Preview 1, this function is about 10x slower than 2.1 and with the nightly builds the delta dropped to 4-5x, so it seems to be related somehow.

Obviously in this case, the performance implications of ignoring the MethodImplOptions.AggressiveInlining hints are quite catastrophic, so I’m wondering if there is a way to truly force inlining or to adjust the budget the JIT allows. The function’s complexity can’t be reduced, and because of the heavy use of intrinsics, the generated code is a small fraction of what it appears to be in C# or IL.

@AndyAyersMS any ideas?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 22 (21 by maintainers)

Most upvoted comments

Seems like a case for transforming those vector methods into extension methods.

mikedn on Jan 28, 2019