runtime: MemoryStream.Write() slow on Ryzen CPUs

I’ve noticed while doing some benchmarking that .NET Core is noticeably (21%) slower than .NET Framework on my AMD Ryzen 1200 based PC for a certain piece of code.

Running the same benchmark on an Intel i7 6700 based PC shows .NET Core running significantly (41%) faster than Framework.

Code to reproduce is here: https://github.com/LordBenjamin/DotNetCore-Ryzen-Performance-Repro/

Benchmarks

  • OriginalBenchmark.cs

Quite close to my actual code. Slower on Ryzen and faster on Intel.

  • StreamWriteBenchmark.cs

Distilled down after I noticed that commenting out the call to MemoryStream.Write(byte[] buffer, int offset, int count); in OriginalBenchmark.cs leaves .NET Core consistently faster on both CPUs. Core performance is significantly worse on Ryzen, but comparable to Framework on Intel.

Affected Frameworks

I can reproduce using both .NET Core 2.1 and 3.0. Framework version is 4.7.2 in both cases.

Results

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  68.21 us | 0.4611 us | 0.4088 us |
| StreamWrite | Core |    Core | 118.42 us | 0.5408 us | 0.5058 us |
| Original    |  Clr |     Clr |  502.4 us | 6.2900 us | 5.8840 us |
| Original    | Core |    Core |  630.2 us | 3.8080 us | 3.5620 us |
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.700
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3801.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |     Mean |     Error |    StdDev |
|------------ |----- |-------- |---------:|----------:|----------:|
| StreamWrite |  Clr |     Clr | 75.37 us | 0.9133 us |  0.7626 us|
| StreamWrite | Core |    Core | 75.39 us | 0.8624 us |  0.8067 us|
| Original    |  Clr |     Clr | 598.0 us |11.6530 us | 14.74 us  |
| Original    | Core |    Core | 341.6 us | 6.8020 us | 10.79 us  |

Happy to supply further information or change and re-run benchmarks as required.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 18 (14 by maintainers)

Most upvoted comments

Got a new CPU (Ryzen 3600) recently and .NET 4.7.2 runs considerably faster, but Core 3.0 runs even slower:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.615 (1809/October2018Update/Redstone5)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  44.45 us | 0.0486 us | 0.0430 us |
| StreamWrite | Core |    Core | 138.92 us | 0.4383 us | 0.4100 us |

Reverting to use the CRT implementation makes it the same speed as .NET Framework (it actually looks to be consistently about 1% faster, but that is likely within error).

Changing the JIT_MemCpy algorithm to remove the rep stosb path makes the code ~2% slower than the .NET Framework code.

As @jkotas called out in the original PR (https://github.com/dotnet/coreclr/pull/7198#discussion_r78892798), having a private copy of memset/memcpy has a number of drawbacks. It also certainly hasn’t undergone nearly the same amount of testing/profiling as the CRT implementation.

I would be very much in favor of removing this path, especially since it causes a nearly 2x slowdown on Ryzen machines. It is also missing a bunch of logic around prefetching, ensuring data alignment, etc that the CRT implementation does currently have.

In any case, you should submit a fix to master for .NET 5. We can then decide whether it will get backported to release/3.0.

@jkotas, @janvorli. I this something we should consider fixing for 3.0, considering it is a nearly 2x penalty for copying memory on Ryzen CPUs?

The fix is relatively trivial, removes us having a custom memcpy routine (which is already 64-bit Windows only), and won’t significantly regress Intel CPUs (based on the numbers in the original PR, it was 33% average for inputs under 512 bytes and no real difference for inputs over 512 bytes).