runtime: AddRemoveFromDifferentThreads.ConcurrentStack benchmark hangs on ARM64
The AddRemoveFromDifferentThreads<string>.ConcurrentStack
benchmark hangs on Windows ARM64 (.NET 7 Preview 1).
Repro:
git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py --architecture arm64 -f net7.0 --dotnet-versions 7.0.100-preview.1.22077.12 --filter *AddRemoveFromDifferentThreads*string*ConcurrentStack*
I was not able to attach a VS debugger to the live process as VS seems to not support it on ARM64 yet.
I’ve captured a dump and uploaded it here.
PerfView shows me that the process is stuck in the following loop:
Since AddRemoveFromDifferentThreads<int>.ConcurrentStack
and AddRemoveFromDifferentThreads<string>.ConcurrentBag
work fine I guess it’s a codegen issue.
Hardware: Surface Pro X, Win 10, arm64
Please let me know if there is any way I could help.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 58 (58 by maintainers)
Commits related to this issue
- crossgen2: Report CORINFO_FLG_NOGCCHECK back for internal calls When there are loops or tailcalls in a function the JIT will check for a dominating call that is a GC safepoint to figure out if the fu... — committed to jakobbotsch/runtime by jakobbotsch 2 years ago
- crossgen2: Report CORINFO_FLG_NOGCCHECK back for internal calls (#65300) When there are loops or tailcalls in a function the JIT will check for a dominating call that is a GC safepoint to figure out... — committed to dotnet/runtime by jakobbotsch 2 years ago
- don't try to run AddRemoveFromDifferentThreads.ConcurrentStack benchmark on ARM64 as it hangs details: https://github.com/dotnet/runtime/issues/64980 — committed to adamsitnik/performance by adamsitnik 2 years ago
- Few workarounds for the next manual perf run (#2282) * define NET7_0_PREVIEW2_OR_GREATER const for .NET 7 Preview2+ builds * don't use Regex.Count API for older versions of .NET 7 (used as baselin... — committed to dotnet/performance by adamsitnik 2 years ago
This is correct as far as I know.
No, this is the exact reason that we decided against running the microbenchmarks on the published SDK. We would find regressions and then have to sort through hundreds of possible commits between the two builds from the runtime repo to try and find the source of the regression. It was also extremely tedious to try and trace back to the correct hashes from the runtime repo.
Do we have some kind of Arm64 VMs that we could do this validation on, or would we need to use our current Arm64 performance hardware? We have very limited Arm64 hardware, and moving to doing this level of validation there would tax that limited infrastructure too greatly.
I remembered it right: https://github.com/dotnet/coreclr/pull/16039
This is arm/arm64 specific issue caused by the architecture (LR usage for returns)
Ah, interesting, I have not noticed that. I’ve ran the repro on my arm64 device again and I’ve found that we are really attempting to suspend the thread again and again without success. I think the problem is due to the fact that the TryPop method is marked as Has tailcalls: 1. And IIRC we don’t hijack return addresses of such methods. Since that method is called in a tight loop without any GC synchronization, there is no place we can suspend the thread.
I am more inclined to just do the daily runs based on SDK because we ship R2R code and as you see, things change between R2R and JIT. Running with CoreRun, we are not measuring the performance of bits that we are shipping. @danmoseley - any thoughts?
I can confirm that the issue doesn’t occur with R2R disabled.
I have verified that this issue doesn’t reproduce when I set
COmPlus_ReadyToRun=0
and most likely seems to be related to codegen. However, I see that this benchmark runs without problem in our perf lab and most likely because we don’t use R2R images for those runs? Is that correct assumption @adamsitnik or @DrewScoggins ? Should we use R2R images because at the end of release, when we do the comparison of .NET X vs. .NET Y, we might see surprises that were not caught throught out the release.I will take a look at the code gen.