runtime: [macOS] Potential regression in delegates invocation

It seems to be affecting only macOS (cc @Lxiamail @jeffhandley)

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter PerfLabTests.DelegatePerf.DelegateInvoke

PerfLabTests.DelegatePerf.DelegateInvoke

Result Ratio Operating System
Same 1.01 Windows 10.0.19043.1165
Same 0.99 Windows 10.0.20348
Same 1.00 Windows 10.0.20348
Same 1.00 Windows 10.0.18363.1621
Same 0.97 Windows 8.1
Same 1.00 Windows 10.0.19042.685
Same 1.00 Windows 10.0.19043.1165
Same 0.99 Windows 10.0.22454
Same 0.99 Windows 10.0.22451
Same 1.00 Windows 10.0.19042.1165
Slower 0.29 Windows 7 SP1
Same 0.99 centos 8
Same 1.00 debian 10
Same 0.99 rhel 7
Same 1.02 sles 15
Same 1.01 opensuse-leap 15.3
Same 1.00 ubuntu 18.04
Same 1.00 ubuntu 18.04
Same 1.00 alpine 3.13
Same 1.00 ubuntu 16.04
Faster 1.33 Windows 10.0.19043.1165
Faster 1.43 Windows 10.0.22000
Same 1.00 Windows 10.0.19043.1165
Same 1.00 Windows 10.0.18363.1621
Same 0.99 Windows 10.0.19043.1165
Slower 0.89 macOS Big Sur 11.5.2
Slower 0.72 macOS Big Sur 11.5.2
Slower 0.78 macOS Big Sur 11.4

cc @AndyAyersMS @kunalspathak @EgorBo

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (19 by maintainers)

Most upvoted comments

FWIW TieredPGO does nicely here in .NET 7, thanks to @jakobbotsch

Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
DelegateInvoke Job-KIGZAI .NET 5.0 net5.0 452.4 us 0.56 us 0.47 us 452.2 us 452.04 us 453.5 us 1.00 - NA
DelegateInvoke Job-GEGJCA .NET 6.0 net6.0 516.8 us 0.21 us 0.18 us 516.8 us 516.49 us 517.1 us 1.14 3 B NA
DelegateInvoke Job-RJIHFR .NET 7.0 net7.0 102.0 us 1.50 us 1.33 us 102.1 us 99.52 us 104.1 us 0.23 17 B NA

Ok, think this is related to loop alignment.

In 5.0 we only 32 byte aligned Tier1 methods with loops. This was fixed in 6.0 (#42909) to handle 32 byte aligning all optimized method with loops. We later went on to add alignment padding for loops in 6.0. But we bypass padding if a loop contains a call.

In this test the jitted codegen is identical in 5.0 and 6.0, and the key inner loop in InvokeDelegate is

G_M6345_IG03:              ;; offset=003EH
       488BC3               mov      rax, rbx
       488B7808             mov      rdi, gword ptr [rax+8]
       498BF6               mov      rsi, r14
       BA64000000           mov      edx, 100
       B964000000           mov      ecx, 100
       FF5018               call     qword ptr [rax+24]DelegateLong:Invoke(System.Object,long,long):long:this
       4C8BF8               mov      r15, rax
       41FFC4               inc      r12d
       48B88067BF0801000000 mov      rax, 0x108BF6780
       443B20               cmp      r12d, dword ptr [rax]
       7CD4                 jl       SHORT G_M6345_IG03
						;; bbWeight=4    PerfScore 39.00
G_M6345_IG04:              ;; offset=006AH

For 6.0 the method is 32 byte aligned, and so this loop body runs from 0x3E … 0x6A and needs 3 fetch windows (plus presumably one more for the call).

For 5.0 the method ends up being 16 byte aligned, this is favorable for the key loop which now spans (effectively) 0x2E…0x5A or two fetch windows (plus perhaps one more for the call).

As a result 5.0 runs faster:

BenchmarkDotNet=v0.13.1.1611-nightly, OS=macOS Catalina 10.15.7 (19H1419) [Darwin 19.6.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-AIDUIT : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-NBIBBZ : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated
DelegateInvoke Job-AIDUIT .NET 5.0 net5.0 462.6 us 2.25 us 2.11 us 462.6 us 460.1 us 467.9 us 1.00 1 B
DelegateInvoke Job-NBIBBZ .NET 6.0 net6.0 530.0 us 3.86 us 3.61 us 528.2 us 525.8 us 536.9 us 1.15 2 B

But if I fix the “bug” in the 5.0 alignment code by setting COMPlus_TC_QuickJitForLoops=1 then 5.0 also gets 32 byte method alignment and so poor loop alignment, and the performance equalizes:

BenchmarkDotNet=v0.13.1.1611-nightly, OS=macOS Catalina 10.15.7 (19H1419) [Darwin 19.6.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-CFHBWL : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-SQMUMX : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

EnvironmentVariables=COMPlus_TC_QuickJitForLoops=1  PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1  
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated
DelegateInvoke Job-CFHBWL .NET 5.0 net5.0 540.9 us 5.78 us 4.83 us 540.5 us 533.7 us 550.2 us 1.00 -
DelegateInvoke Job-SQMUMX .NET 6.0 net6.0 540.7 us 2.75 us 2.58 us 541.0 us 534.7 us 544.4 us 1.00 2 B

If you modify the benchmark to use a param instead of a static for the loop limit InnerIterationCount200000 we see similar swings in perf, as removing the class init check modifies the loop alignment.

At any rate, here’s a case where aligning a loop with a call seems to have a noticeable impact on perf, because the callee is trivial.

cc @kunalspathak

I can see if this repros on my Mac Mini, but it may take me a while to get around to it…

From the above we can see 5.0 is consistently fastest, with 6.0 or 7.0 sometimes similar to 5.0 and sometimes slower depending on the particular processor. 7.0 is generally faster than 6.0 though not always.

I modified the jit to align small (single block) loops with calls (as in the benchmark) and didn’t see any improvement, so at this point I suspect the perf differences are related to the alignment of the delegate or its precode.

I think this sort of thing is only going to show up prominently when we have frequently executed delegates that do little or no computation. So, I’m going to close this as won’t fix.

Current data with 7.0 looks better but still not quite as good as 5.0

BenchmarkDotNet=v0.13.1.1786-nightly, OS=macOS Monterey 12.4 (21F79) [Darwin 21.5.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=7.0.100-preview.5.22307.18
  [Host]     : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  Job-TXZCIP : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-NIIKGA : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  Job-HAVFNP : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
DelegateInvoke Job-KVFMUW .NET 5.0 net5.0 452.8 us 0.22 us 0.20 us 452.8 us 452.5 us 453.2 us 1.00 1 B 1.00
DelegateInvoke Job-HRZMKX .NET 6.0 net6.0 613.1 us 6.84 us 6.40 us 612.8 us 604.2 us 624.9 us 1.35 2 B 2.00
DelegateInvoke Job-PUBOIJ .NET 7.0 net7.0 518.0 us 0.85 us 0.75 us 518.2 us 517.0 us 519.5 us 1.14 2 B 2.00

I can repro this. Will see what I can uncover…

BenchmarkDotNet=v0.13.1.1611-nightly, OS=macOS High Sierra 10.13.6 (17G14042) [Darwin 17.7.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-IIHWUZ : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-SZCDQB : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  

|         Method |        Job |  Runtime | Toolchain |     Mean |   Error |  StdDev |   Median |      Min |      Max | Ratio | Allocated |
|--------------- |----------- |--------- |---------- |---------:|--------:|--------:|---------:|---------:|---------:|------:|----------:|
| DelegateInvoke | Job-IIHWUZ | .NET 5.0 |    net5.0 | 458.4 us | 1.22 us | 1.08 us | 458.3 us | 457.1 us | 461.2 us |  1.00 |       1 B |
| DelegateInvoke | Job-SZCDQB | .NET 6.0 |    net6.0 | 523.7 us | 1.02 us | 0.85 us | 523.8 us | 522.3 us | 524.8 us |  1.14 |       2 B |

I’m moving this to 7.0.0, but we’ll still want to get to the root cause to make sure we understand the impact and tradeoff.

What HW was this…?

@AndyAyersMS it was reproducible on all three x64 mac books that we have used (mine is 4 year old mac book pro, not sure about @jeffhandley or @carlossanlop laptops who provided the other results).

I need to redirect this back to you for initial investigation per Andy’s comments.

I am able to reproduce it on my macBook, the problem is that there is no easy way to get disassembly on macOS so I can’t just share it. That is why it would be better if someone from the JIT Team took a look at this.