runtime: Regression in AsSpan for uint[] Datatype

Netcoreapp 3.0 Sdk Version 3.0.100-preview-009841


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009630
  [Host]     : .NET Core 3.0.0-preview1-26928-03 (CoreCLR 4.6.26927.03, CoreFX 4.6.26927.03), 64bit RyuJIT
  Job-TQETEJ : .NET Core 3.0.0-preview1-26928-03 (CoreCLR 4.6.26927.03, CoreFX 4.6.26927.03), 64bit RyuJIT

BuildConfiguration=Release-Intrinsics  Toolchain=netcoreapp3.0  InvocationCount=1  
MaxIterationCount=20  UnrollFactor=1  WarmupCount=1

Method	Mean	Error	StdDev	Median	Extra Metric
AsSpanForUnitBenchmark	1.445 us	0.8379 us	0.9649 us	0.8400 us	-

Sdk version 2.1.401 Netcoreapp2.1


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009630
  [Host]     : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
  Job-JARWGC : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp2.1  InvocationCount=1  
MaxIterationCount=20  UnrollFactor=1  WarmupCount=1

Method	Mean	Error	StdDev	Median	Extra Metric
AsSpanForUnitBenchmark	48.75 ns	29.60 ns	29.07 ns	65.00 ns	-

The code for this benchmark is


public uint[] input;
public int end;

 [IterationSetup(Target = nameof(AsSpanForUnitBenchmark))]
public void setup()
 {
       int min = 0;
        int max = 100000;
       Random randNum = new Random();
       end = randNum.Next(min, max);

        input = new uint[max]
       for (int i = 0; i < input.Length; i++)
        {
               input[i] = (uint)randNum.Next(min, max);
        }
}

[Benchmark]
public void AsSpanForUnitBenchmark()
{
    input.AsSpan(0, end);
}

cc @danmosemsft @adamsitnik @GrabYourPitchforks

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 28 (28 by maintainers)

Most upvoted comments

I’ll take a look, recently I stumbled upon a JIT change that gets rid of some of these funny moves. Maybe we’re lucky and it’s the same case.

Also, the second mov in

mov     ecx,dword ptr [rax+8]
mov     ecx,ecx

might be eliminated by another change I have.

mikedn on Dec 19, 2018

Am going to close since I think this is fixed – please reopen if that’s not the case.

AndyAyersMS on Feb 26, 2019

I don’t see much perf difference between 3.0 Preview 2 and 2.1 on the test case that was added in dotnet/performance#260 to try and capture this issue.

BenchmarkDotNet=v0.11.3.1003-nightly, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview4-010309
  [Host]     : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  Job-WTCAUC : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  Job-YHFHGF : .NET Core 3.0.0-preview-27412-2 (CoreCLR 4.6.27411.71, CoreFX 4.7.19.11201), 64bit RyuJIT

Runtime=Core  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

Method	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0/1k Op	Gen 1/1k Op	Gen 2/1k Op	Allocated Memory/Op
ArrayAsSpanStartLength	netcoreapp2.1	0.3919 ns	0.0134 ns	0.0125 ns	0.3879 ns	0.3705 ns	0.4164 ns	1.00	0.00	-	-	-	-
ArrayAsSpanStartLength	netcoreapp3.0	0.4198 ns	0.0288 ns	0.0269 ns	0.4166 ns	0.3848 ns	0.4819 ns	1.07	0.08	-	-	-	-

@Anipik can you remeasure your test with 3.0 preview 2?

AndyAyersMS on Feb 19, 2019

This should have been be improved by dotnet/coreclr#19429 and dotnet/coreclr#22454 …

;;; 2.1

G_M12530_IG04:
       83780800             cmp      dword ptr [rax+8], 0
       721D                 jb       SHORT G_M12530_IG09
       395008               cmp      dword ptr [rax+8], edx
       7218                 jb       SHORT G_M12530_IG09

G_M12530_IG05:
       488D4810             lea      rcx, bword ptr [rax+16]

G_M12530_IG06:
       48894C2428           mov      bword ptr [rsp+28H], rcx
       89542430             mov      dword ptr [rsp+30H], edx

and master with the above PRs merged does show promise…

;;; master plus 22454

G_M12530_IG04:
       8B4808               mov      ecx, dword ptr [rax+8]
       448BC2               mov      r8d, edx
       493BC8               cmp      rcx, r8
       721B                 jb       SHORT G_M12530_IG09

G_M12530_IG05:
       4883C010             add      rax, 16
       488BC8               mov      rcx, rax

G_M12530_IG06:
       48894C2428           mov      bword ptr [rsp+28H], rcx
       89542430             mov      dword ptr [rsp+30H], edx

AndyAyersMS on Feb 7, 2019

But is it LSRA’s job to deal with poor IR generated by previous phases?

Not entirely. But given phase ordering and all, I think it’s generally accepted that a register allocator should incorporate copy elimination, either as part of its function, or done prior to.

CarolEidt on Dec 20, 2018

Good find @Anipik 🥂

danmoseley on Dec 18, 2018

I do not think we have any active perf issue on this one.

I believe this regression was introduced by https://github.com/dotnet/coreclr/pull/20771. Here is the disassembly difference:

Before:

...
cmp     dword ptr [rax+8],0
jb      AsSpanForUnitBenchmark()+...
cmp     dword ptr [rax+8],edx
jb      AsSpanForUnitBenchmark()+...
lea     rcx,[rax+10h]
mov     qword ptr [rsp+28h],rcx
mov     dword ptr [rsp+30h],edx
...

After:

mov     ecx,dword ptr [rax+8]
mov     ecx,ecx
mov     r8d,edx
cmp     rcx,r8
jb      AsSpanForUnitBenchmark()+...
add     rax,10h
mov     rcx,rax
mov     rax,rcx
mov     rcx,rax
mov     qword ptr [rsp+28h],rcx
mov     dword ptr [rsp+30h],edx

The JIT code is bigger and its has more data dependencies, and thus runs slower. I have verified that reverting the commit changes the code to what it used to be.

cc @GrabYourPitchforks @ahsonkhan

jkotas on Dec 18, 2018