runtime: Span-perf for copy via loop much slower than array-version
Description
In a simple copy-loop Span is a lot slower than an array-version. I’d expect that Span and array have similar perf.
Note: Span_CopyTo is just for reference included.
Benchmark
Results
BenchmarkDotNet=v0.10.13, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008384
[Host] : .NET Core 2.1.0-preview2-26313-01 (CoreCLR 4.6.26310.01, CoreFX 4.6.26313.01), 64bit RyuJIT
DefaultJob : .NET Core 2.1.0-preview2-26313-01 (CoreCLR 4.6.26310.01, CoreFX 4.6.26313.01), 64bit RyuJIT
| Method | Mean | Error | StdDev | Scaled |
|---|---|---|---|---|
| Array | 4.435 us | 0.0400 us | 0.0374 us | 1.00 |
| Span | 6.885 us | 0.0307 us | 0.0256 us | 1.55 |
| Span_CopyTo | 1.132 us | 0.0072 us | 0.0067 us | 0.26 |
Code
public class Benchmarks
{
private readonly int[] _source;
private readonly int[] _destination;
public Benchmarks()
{
_source = Enumerable.Range(0, 10_000).ToArray();
_destination = new int[_source.Length];
}
[Benchmark(Baseline = true)]
public void Array()
{
Copy(_source, _destination);
}
[Benchmark]
public void Span()
{
Copy(_source.AsSpan(), _destination.AsSpan());
}
[Benchmark]
public void Span_CopyTo()
{
_source.AsSpan().CopyTo(_destination.AsSpan());
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Copy(int[] src, int[] dst)
{
for (int i = 0; i < src.Length; ++i)
dst[i] = src[i];
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Copy(ReadOnlySpan<int> src, Span<int> dst)
{
for (int i = 0; i < src.Length; ++i)
dst[i] = src[i];
}
}
Disassembly
Note: The JitDisasm is produced of fresh coreclr-build on ubuntu.
dotnet --info
.NET Core SDK (reflecting any global.json):
Version: 2.1.300-preview3-008384
Commit: 4343118151
Runtime Environment:
OS Name: ubuntu
OS Version: 16.04
OS Platform: Linux
RID: ubuntu.16.04-x64
Base Path: /usr/share/dotnet/sdk/2.1.300-preview3-008384/
Host (useful for support):
Version: 2.1.0-preview3-26319-04
Commit: 939333dbc8
.NET Core SDKs installed:
2.1.4 [/usr/share/dotnet/sdk]
2.1.300-preview3-008384 [/usr/share/dotnet/sdk]
.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.0-preview2-30338 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.0-preview2-30338 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.0.5 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.0-preview2-26313-01 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.0-preview3-26319-04 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
Also only the loops are shown.
Array-variant
G_M16768_IG04:
4863C8 movsxd rcx, eax
448B448F10 mov r8d, dword ptr [rdi+4*rcx+16]
4489448E10 mov dword ptr [rsi+4*rcx+16], r8d
FFC0 inc eax
3BD0 cmp edx, eax
7FED jg SHORT G_M16768_IG04
Span-variant
G_M16768_IG03:
413BDE cmp ebx, r14d
732D jae SHORT G_M16768_IG05
4863FB movsxd rdi, ebx
3B5DE0 cmp ebx, dword ptr [rbp-20H]
7325 jae SHORT G_M16768_IG05
488B45D8 mov rax, bword ptr [rbp-28H]
8B04B8 mov eax, dword ptr [rax+4*rdi]
418904BF mov dword ptr [r15+4*rdi], eax
FFC3 inc ebx
488D7DD8 lea rdi, bword ptr [rbp-28H]
E8278F5CFF call System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
3BC3 cmp eax, ebx
7FD9 jg SHORT G_M16768_IG03
category:cq theme:loop-opt skill-level:expert cost:medium
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 22 (22 by maintainers)
Yep, I can reproduce your numbers on my machine with the assembly I posted above. Whatever happens on Linux that causes
Lengthnot to be inlined is a separate issue.Best I can tell the reason why the span version is slower is a combination of lack of loop cloning and lack of hoisting of destination span field loads. This can be seen if you try the following version:
that generates
This does not have range checks and no in-loop loads of span fields. It does have an extra compare so it’s still a bit slower than the array version but quite a bit faster than the original version.
Best practice would be to locally build both Release and Debug. Then use the release bits to overwrite the DLLs. Then overwrite again with just the Debug-built jit DLL so you can enable disasm and dumps and such.
So you end up with all the copied bits from the same build, and everything copied is built Release except for the jit, which is Debug.