runtime: Performance regression of XXHash128 for large buffers with .NET 8

Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2
  Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2


|          Method |  Runtime |          data |     Mean |    Error |   StdDev | Ratio |
|---------------- |--------- |-------------- |---------:|---------:|---------:|------:|
| XXHash128Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us |  1.00 |
|       XXHash128 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us |  1.16 |
|       XXHash128 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us |  1.66 |

I haven’t dug into why, but considering the performance drop, my first suspect would be less inlining.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

Cool, by generating a compatible AVX2 64bit * 64bit vectorized multiplication, I’m now getting 10% faster than the C++ version, that was it! 😎 I should have checked more thoroughly the generated ASM in the first place. 😅

Will try to prepare a PR later to add AVX2 + SSE2 and double check if can create also an ARM64 version.

Btw, https://github.com/dotnet/runtime/pull/86811 touches the same path in JIT for byte as far as I can see (but I still think it’d better to be done on the C# side)

@EgorBo I don’t see any code that is using e.g AVX2, SSE2

Right, because it’s implemented in JIT, namely, here: https://github.com/dotnet/runtime/blob/da1da02bbd2cb54490b7fc22f43ec32f5f302615/src/coreclr/jit/hwintrinsicxarch.cpp#L2056-L2103 (as you can see it even has a TODO for V256<long> multiplication when Avx512dq_vl is not presented.

So you either implement it in JIT right there or in C#. When JIT doesn’t handle the intrinsic it falls back to C# implementation so if you implement a path

if (typeof(T) == typeof(long) && Avx2.IsSupported)
{
    ...
}

it should be taken. Whatever path you want to take is up to you, I, personally, prefer to do it in C# if it’s possible - it’s simpler and is ILLink friendly (also, might help mono as well).

Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I’m getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj…

I personally rarely do so, but to open the sln I do:

.\build.cmd Clr -c Debug -vs .\src\coreclr\System.Private.CoreLib\System.Private.CoreLib.csproj

Btw, if you know how to optimize the multiplication of two Vector256<ulong> using pre-avx512 you might want to do it inside the operator * itself? I mean here:

Yep, exactly, that’s what I would have tried, thanks for confirming. Let me check If I can come up with something there.

Btw, if you know how to optimize the multiplication of two Vector256<ulong> using pre-avx512 you might want to do it inside the operator * itself? I mean here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs#L268

The method name in the table suggests you’re measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?

Sorry, udpated, the naming was from an old benchmark, but implementation is using XXHash128