runtime: [macOS] Potential regression in delegates invocation
It seems to be affecting only macOS (cc @Lxiamail @jeffhandley)
git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter PerfLabTests.DelegatePerf.DelegateInvoke
PerfLabTests.DelegatePerf.DelegateInvoke
| Result | Ratio | Operating System |
|---|---|---|
| Same | 1.01 | Windows 10.0.19043.1165 |
| Same | 0.99 | Windows 10.0.20348 |
| Same | 1.00 | Windows 10.0.20348 |
| Same | 1.00 | Windows 10.0.18363.1621 |
| Same | 0.97 | Windows 8.1 |
| Same | 1.00 | Windows 10.0.19042.685 |
| Same | 1.00 | Windows 10.0.19043.1165 |
| Same | 0.99 | Windows 10.0.22454 |
| Same | 0.99 | Windows 10.0.22451 |
| Same | 1.00 | Windows 10.0.19042.1165 |
| Slower | 0.29 | Windows 7 SP1 |
| Same | 0.99 | centos 8 |
| Same | 1.00 | debian 10 |
| Same | 0.99 | rhel 7 |
| Same | 1.02 | sles 15 |
| Same | 1.01 | opensuse-leap 15.3 |
| Same | 1.00 | ubuntu 18.04 |
| Same | 1.00 | ubuntu 18.04 |
| Same | 1.00 | alpine 3.13 |
| Same | 1.00 | ubuntu 16.04 |
| Faster | 1.33 | Windows 10.0.19043.1165 |
| Faster | 1.43 | Windows 10.0.22000 |
| Same | 1.00 | Windows 10.0.19043.1165 |
| Same | 1.00 | Windows 10.0.18363.1621 |
| Same | 0.99 | Windows 10.0.19043.1165 |
| Slower | 0.89 | macOS Big Sur 11.5.2 |
| Slower | 0.72 | macOS Big Sur 11.5.2 |
| Slower | 0.78 | macOS Big Sur 11.4 |
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (19 by maintainers)
FWIW TieredPGO does nicely here in .NET 7, thanks to @jakobbotsch
Ok, think this is related to loop alignment.
In 5.0 we only 32 byte aligned Tier1 methods with loops. This was fixed in 6.0 (#42909) to handle 32 byte aligning all optimized method with loops. We later went on to add alignment padding for loops in 6.0. But we bypass padding if a loop contains a call.
In this test the jitted codegen is identical in 5.0 and 6.0, and the key inner loop in
InvokeDelegateisFor 6.0 the method is 32 byte aligned, and so this loop body runs from 0x3E … 0x6A and needs 3 fetch windows (plus presumably one more for the call).
For 5.0 the method ends up being 16 byte aligned, this is favorable for the key loop which now spans (effectively) 0x2E…0x5A or two fetch windows (plus perhaps one more for the call).
As a result 5.0 runs faster:
But if I fix the “bug” in the 5.0 alignment code by setting
COMPlus_TC_QuickJitForLoops=1then 5.0 also gets 32 byte method alignment and so poor loop alignment, and the performance equalizes:If you modify the benchmark to use a param instead of a static for the loop limit
InnerIterationCount200000we see similar swings in perf, as removing the class init check modifies the loop alignment.At any rate, here’s a case where aligning a loop with a call seems to have a noticeable impact on perf, because the callee is trivial.
cc @kunalspathak
I can see if this repros on my Mac Mini, but it may take me a while to get around to it…
From the above we can see 5.0 is consistently fastest, with 6.0 or 7.0 sometimes similar to 5.0 and sometimes slower depending on the particular processor. 7.0 is generally faster than 6.0 though not always.
I modified the jit to align small (single block) loops with calls (as in the benchmark) and didn’t see any improvement, so at this point I suspect the perf differences are related to the alignment of the delegate or its precode.
I think this sort of thing is only going to show up prominently when we have frequently executed delegates that do little or no computation. So, I’m going to close this as won’t fix.
Current data with 7.0 looks better but still not quite as good as 5.0
I can repro this. Will see what I can uncover…
I’m moving this to 7.0.0, but we’ll still want to get to the root cause to make sure we understand the impact and tradeoff.
@AndyAyersMS it was reproducible on all three x64 mac books that we have used (mine is 4 year old mac book pro, not sure about @jeffhandley or @carlossanlop laptops who provided the other results).
I am able to reproduce it on my macBook, the problem is that there is no easy way to get disassembly on macOS so I can’t just share it. That is why it would be better if someone from the JIT Team took a look at this.