runtime: Performance regression of XXHash128 for large buffers with .NET 8
Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:
BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2
Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
| Method | Runtime | data | Mean | Error | StdDev | Ratio |
|---------------- |--------- |-------------- |---------:|---------:|---------:|------:|
| XXHash128Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us | 1.00 |
| XXHash128 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us | 1.16 |
| XXHash128 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us | 1.66 |
I haven’t dug into why, but considering the performance drop, my first suspect would be less inlining.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 18 (18 by maintainers)
Cool, by generating a compatible AVX2 64bit * 64bit vectorized multiplication, I’m now getting 10% faster than the C++ version, that was it! 😎 I should have checked more thoroughly the generated ASM in the first place. 😅
Will try to prepare a PR later to add AVX2 + SSE2 and double check if can create also an ARM64 version.
Btw, https://github.com/dotnet/runtime/pull/86811 touches the same path in JIT for
byte
as far as I can see (but I still think it’d better to be done on the C# side)Right, because it’s implemented in JIT, namely, here: https://github.com/dotnet/runtime/blob/da1da02bbd2cb54490b7fc22f43ec32f5f302615/src/coreclr/jit/hwintrinsicxarch.cpp#L2056-L2103 (as you can see it even has a TODO for V256<long> multiplication when Avx512dq_vl is not presented.
So you either implement it in JIT right there or in C#. When JIT doesn’t handle the intrinsic it falls back to C# implementation so if you implement a path
it should be taken. Whatever path you want to take is up to you, I, personally, prefer to do it in C# if it’s possible - it’s simpler and is ILLink friendly (also, might help mono as well).
I personally rarely do so, but to open the sln I do:
Yep, exactly, that’s what I would have tried, thanks for confirming. Let me check If I can come up with something there.
Btw, if you know how to optimize the multiplication of two
Vector256<ulong>
using pre-avx512 you might want to do it inside theoperator *
itself? I mean here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs#L268Sorry, udpated, the naming was from an old benchmark, but implementation is using XXHash128