runtime: PlaintextMVC benchmark is slow on arm64

Plaintext-PlaintextMVC benchmark should benefit a lot from PGO (namely, from guarded devirtualizations and inlining) and it does benefit from it on all x64 platforms (Linux, Windows, Intel, Amd, etc) - up to +40% more RPS. Unfortunately, it’s not the case for arm64 where there is no difference between DynamicPGO and Default. Moreover, the benchmark is 7-8x times slower on arm64 in comparison with x64-dynamicpgo (while I’d expect it to be 1.5-2x slower only).

It looks to me that on arm64 it’s bound to JIT_New:

while on x64 it looks like this:

Namely, this call-site (“drilled”) (https://github.com/dotnet/aspnetcore/blob/a0b950fc51c43289ab3c6dbea15926d32f3556cc/src/Mvc/Mvc.Core/src/Routing/ControllerRequestDelegateFactory.cs#L68-L101): Arm64:

same call-site on x64:

Flamegraph for arm64 (two JIT_New are highlighted) for the first th:

x64:

Does it ring a bell to anyone (e.g. JIT_NewS_MP_FastPortable is not used, some gc feature is not implemented for arm64, some allocation tracker/profiler is enabled, etc.)? /cc @dotnet/jit-contrib @Maoni0 @jkotas @davidwrighton I can re-run the benchmark with any modifications in jit/vm/gc you suggest.

Steps to get the native traces:

Arm64-Linux:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --scenario mvc --profile aspnet-citrine-arm --application.collect true --application.collectStartup false --application.options.collectCounters false   --application.dotnetTrace false --application.framework net6.0 --application.environmentVariables DOTNET_TieredPGO=1  --application.environmentVariables DOTNET_TC_QuickJitForLoops=1  --application.environmentVariables DOTNET_ReadyToRun=0

x64-Linux:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --scenario mvc --profile aspnet-citrine-lin --application.collect true --application.collectStartup false --application.options.collectCounters false   --application.dotnetTrace false --application.framework net6.0 --application.environmentVariables DOTNET_TieredPGO=1  --application.environmentVariables DOTNET_TC_QuickJitForLoops=1  --application.environmentVariables DOTNET_ReadyToRun=0

Powerbi link: https://aka.ms/aspnet/benchmarks (open “PGO” page there)

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 59 (58 by maintainers)

Most upvoted comments

@EgorBo based on how many benchmarks you must have run, I believe I should ask for some extra fans to be installed in the machine

sebastienros on Oct 13, 2021

with the data from @EgorBo and @sebastienros I got to the culprit. on arm64 we are reading the cache size this way

        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index0/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index1/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index2/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index3/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index4/size", &size))
            cacheSize = std::max(cacheSize, size);

and on this particular arm64 machine there’s no entry for the L3 cache (it only has index0/1 which are for data/instruction L1 cache size and index2 which is for L2 cache). and since we take the largest which is the L2 cache size, it’s 256k, we return 2x that which is 768k and gen0 min budget is calculated as 5/8 of this which is 480k which is of course tiny.

if folks know of a way to get the L3 cache size programmatically on linux in this case I’m all ears.

Maoni0 on Oct 11, 2021

@AntonLapounov that’s just the generation size after a GC, if there’s no pinning, it could easily be 24 bytes (which is just a min object size)

Maoni0 on Feb 2, 2022

It explains why it’s so fast then 🙂

EgorBo on Oct 8, 2021

Plaintext is not allocating and never triggers a GC so you don’t get any stats.

sebastienros on Oct 8, 2021

Notice that the counters show Gen0 size of 438,360 vs. 672. That is off a lot…

jkotas on Oct 8, 2021

NB: these are markdown formatted and also correctly spaced so you can either paste them as-is or wrap them in triple back ticks.

Stating the obvious: Some x64 numbers are multiples of arm64 because RPS is just higher, e.g. Allocations, TP, so nothing weird here. But the GC is abnormally high on arm64: 66% cpu and 59 gen 0 collections per second and the Gen 0 size significantly higher too.

Also can you use --chart, this will show if the max GC (66%) is just a small period of time, of if it’s for the whole run.

sebastienros on Oct 8, 2021

AFAIR the Docker image defined in TE’s repo for Actix is not working for ARM64. Will try again.

sebastienros on Oct 8, 2021