runtime: MemoryStream.Write() slow on Ryzen CPUs
I’ve noticed while doing some benchmarking that .NET Core is noticeably (21%) slower than .NET Framework on my AMD Ryzen 1200 based PC for a certain piece of code.
Running the same benchmark on an Intel i7 6700 based PC shows .NET Core running significantly (41%) faster than Framework.
Code to reproduce is here: https://github.com/LordBenjamin/DotNetCore-Ryzen-Performance-Repro/
Benchmarks
OriginalBenchmark.cs
Quite close to my actual code. Slower on Ryzen and faster on Intel.
StreamWriteBenchmark.cs
Distilled down after I noticed that commenting out the call to MemoryStream.Write(byte[] buffer, int offset, int count); in OriginalBenchmark.cs leaves .NET Core consistently faster on both CPUs. Core performance is significantly worse on Ryzen, but comparable to Framework on Intel.
Affected Frameworks
I can reproduce using both .NET Core 2.1 and 3.0. Framework version is 4.7.2 in both cases.
Results
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
[Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
Core : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite | Clr | Clr | 68.21 us | 0.4611 us | 0.4088 us |
| StreamWrite | Core | Core | 118.42 us | 0.5408 us | 0.5058 us |
| Original | Clr | Clr | 502.4 us | 6.2900 us | 5.8840 us |
| Original | Core | Core | 630.2 us | 3.8080 us | 3.5620 us |
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.700
[Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3801.0
Core : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
| Method | Job | Runtime | Mean | Error | StdDev |
|------------ |----- |-------- |---------:|----------:|----------:|
| StreamWrite | Clr | Clr | 75.37 us | 0.9133 us | 0.7626 us|
| StreamWrite | Core | Core | 75.39 us | 0.8624 us | 0.8067 us|
| Original | Clr | Clr | 598.0 us |11.6530 us | 14.74 us |
| Original | Core | Core | 341.6 us | 6.8020 us | 10.79 us |
Happy to supply further information or change and re-run benchmarks as required.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 18 (14 by maintainers)
Got a new CPU (Ryzen 3600) recently and .NET 4.7.2 runs considerably faster, but Core 3.0 runs even slower:
Reverting to use the CRT implementation makes it the same speed as .NET Framework (it actually looks to be consistently about 1% faster, but that is likely within error).
Changing the
JIT_MemCpyalgorithm to remove therep stosbpath makes the code ~2% slower than the .NET Framework code.As @jkotas called out in the original PR (https://github.com/dotnet/coreclr/pull/7198#discussion_r78892798), having a private copy of
memset/memcpyhas a number of drawbacks. It also certainly hasn’t undergone nearly the same amount of testing/profiling as the CRT implementation.I would be very much in favor of removing this path, especially since it causes a nearly 2x slowdown on Ryzen machines. It is also missing a bunch of logic around prefetching, ensuring data alignment, etc that the CRT implementation does currently have.
In any case, you should submit a fix to master for .NET 5. We can then decide whether it will get backported to release/3.0.
@jkotas, @janvorli. I this something we should consider fixing for 3.0, considering it is a nearly 2x penalty for copying memory on Ryzen CPUs?
The fix is relatively trivial, removes us having a custom memcpy routine (which is already 64-bit Windows only), and won’t significantly regress Intel CPUs (based on the numbers in the original PR, it was 33% average for inputs under 512 bytes and no real difference for inputs over 512 bytes).