runtime: Span-perf for copy via loop much slower than array-version

Description

In a simple copy-loop Span is a lot slower than an array-version. I’d expect that Span and array have similar perf.

Note: Span_CopyTo is just for reference included.

Benchmark

Results


BenchmarkDotNet=v0.10.13, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008384
  [Host]     : .NET Core 2.1.0-preview2-26313-01 (CoreCLR 4.6.26310.01, CoreFX 4.6.26313.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview2-26313-01 (CoreCLR 4.6.26310.01, CoreFX 4.6.26313.01), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled
Array	4.435 us	0.0400 us	0.0374 us	1.00
Span	6.885 us	0.0307 us	0.0256 us	1.55
Span_CopyTo	1.132 us	0.0072 us	0.0067 us	0.26

Code

public class Benchmarks
{
    private readonly int[] _source;
    private readonly int[] _destination;

    public Benchmarks()
    {
        _source = Enumerable.Range(0, 10_000).ToArray();
        _destination = new int[_source.Length];
    }

    [Benchmark(Baseline = true)]
    public void Array()
    {
        Copy(_source, _destination);
    }

    [Benchmark]
    public void Span()
    {
        Copy(_source.AsSpan(), _destination.AsSpan());
    }

    [Benchmark]
    public void Span_CopyTo()
    {
        _source.AsSpan().CopyTo(_destination.AsSpan());
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Copy(int[] src, int[] dst)
    {
        for (int i = 0; i < src.Length; ++i)
            dst[i] = src[i];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Copy(ReadOnlySpan<int> src, Span<int> dst)
    {
        for (int i = 0; i < src.Length; ++i)
            dst[i] = src[i];
    }
}

Disassembly

Note: The JitDisasm is produced of fresh coreclr-build on ubuntu.

dotnet --info

.NET Core SDK (reflecting any global.json):
 Version:   2.1.300-preview3-008384
 Commit:    4343118151

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  16.04
 OS Platform: Linux
 RID:         ubuntu.16.04-x64
 Base Path:   /usr/share/dotnet/sdk/2.1.300-preview3-008384/

Host (useful for support):
  Version: 2.1.0-preview3-26319-04
  Commit:  939333dbc8

.NET Core SDKs installed:
  2.1.4 [/usr/share/dotnet/sdk]
  2.1.300-preview3-008384 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0-preview2-30338 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0-preview2-30338 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.0.5 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-preview2-26313-01 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-preview3-26319-04 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Also only the loops are shown.

Array-variant

G_M16768_IG04:
       4863C8               movsxd   rcx, eax
       448B448F10           mov      r8d, dword ptr [rdi+4*rcx+16]
       4489448E10           mov      dword ptr [rsi+4*rcx+16], r8d
       FFC0                 inc      eax
       3BD0                 cmp      edx, eax
       7FED                 jg       SHORT G_M16768_IG04

Span-variant

G_M16768_IG03:
       413BDE               cmp      ebx, r14d
       732D                 jae      SHORT G_M16768_IG05
       4863FB               movsxd   rdi, ebx
       3B5DE0               cmp      ebx, dword ptr [rbp-20H]
       7325                 jae      SHORT G_M16768_IG05
       488B45D8             mov      rax, bword ptr [rbp-28H]
       8B04B8               mov      eax, dword ptr [rax+4*rdi]
       418904BF             mov      dword ptr [r15+4*rdi], eax
       FFC3                 inc      ebx
       488D7DD8             lea      rdi, bword ptr [rbp-28H]
       E8278F5CFF           call     System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
       3BC3                 cmp      eax, ebx
       7FD9                 jg       SHORT G_M16768_IG03

category:cq theme:loop-opt skill-level:expert cost:medium

About this issue

Original URL
State: open
Created 6 years ago
Comments: 22 (22 by maintainers)

Commits related to this issue

Updated steps in 'Viewing JIT Dumps' to see optimized code (#17077) If one follows the current described steps, one won't see the JIT dump for optimized code in the core lib. This PR adds the necess... — committed to dotnet/coreclr by gfoidl 6 years ago

Most upvoted comments

Even if the dasm looks quite good for Windows, the perf-numbers aren’t.

Yep, I can reproduce your numbers on my machine with the assembly I posted above. Whatever happens on Linux that causes Length not to be inlined is a separate issue.

Best I can tell the reason why the span version is slower is a combination of lack of loop cloning and lack of hoisting of destination span field loads. This can be seen if you try the following version:

for (int i = 0; i < src.Length && i < dst.Length; ++i)
    dst[i] = src[i];

that generates

G_M55888_IG02:
       488B02               mov      rax, bword ptr [rdx]
       8B5208               mov      edx, dword ptr [rdx+8]
       4C8B01               mov      r8, bword ptr [rcx]
       8B4908               mov      ecx, dword ptr [rcx+8]
       4533C9               xor      r9d, r9d
       85C9                 test     ecx, ecx
       7E1A                 jle      SHORT G_M55888_IG05
       EB13                 jmp      SHORT G_M55888_IG04
G_M55888_IG03:
       4D63D1               movsxd   r10, r9d
       478B1C90             mov      r11d, dword ptr [r8+4*r10]
       46891C90             mov      dword ptr [rax+4*r10], r11d
       41FFC1               inc      r9d
       443BC9               cmp      r9d, ecx
       7D05                 jge      SHORT G_M55888_IG05
G_M55888_IG04:
       443BCA               cmp      r9d, edx
       7CE8                 jl       SHORT G_M55888_IG03
G_M55888_IG05:
       C3                   ret

This does not have range checks and no in-loop loads of span fields. It does have an extra compare so it’s still a bit slower than the array version but quite a bit faster than the original version.

mikedn on Mar 20, 2018

Best practice would be to locally build both Release and Debug. Then use the release bits to overwrite the DLLs. Then overwrite again with just the Debug-built jit DLL so you can enable disasm and dumps and such.

So you end up with all the copied bits from the same build, and everything copied is built Release except for the jit, which is Debug.

AndyAyersMS on Mar 20, 2018