runtime: Performance regression: 6x slower array allocation on Alpine

From the data that I got from @danmosemsft which was collected by running dotnet/performance microbenchmarks on alpine 3.11 via WSL2, it looks like allocating arrays of both value and reference types became 6 times slower compared to 3.1.

Initially, I thought that it was just an outlier, but I can see the same pattern for other collections that internally use arrays (queue, list, stack etc). The regression is specific to alpine. Ubuntu 18.04 (with and without WSL2) is fine.

@jkotas @janvorli who would be the best person to investigate that?

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter 'System.Collections.CtorGivenSize<Int32>.Array'

System.Collections.CtorGivenSize<Int32>.Array(Size: 512)

Conclusion Base Diff Base/Diff Modality Operating System Bit Processor Name Base Runtime Diff Runtime
Same 181.79 183.96 0.99 Windows 10.0.18363.1016 Arm Microsoft SQ1 3.0 GHz .NET Core 3.1.6 5.0.100-rc.1.20413.9
Same 92.89 94.47 0.98 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20404.3
Same 96.05 94.36 1.02 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20418.3
Same 114.74 111.94 1.03 Windows 10.0.19041.450 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20413.9
Same 80.49 79.98 1.01 Windows 10.0.19041.450 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) .NET Core 3.1.6 5.0.100-rc.1.20419.9
Same 67.30 67.66 0.99 bimodal Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 5.0.100-rc.1.20418.3
Same 86.10 79.17 1.09 bimodal Windows 10.0.19041.450 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) .NET Core 3.1.6 5.0.100-rc.1.20419.14
Same 97.50 98.77 0.99 Windows 10.0.18363.959 X86 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20420.14
Slower 127.02 150.46 0.84 bimodal Windows 10.0.19041.450 X86 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20419.5
Slower 193.61 287.83 0.67 bimodal ubuntu 18.04 Arm64 Unknown processor .NET Core 3.1.6 6.0.100-alpha.1.20421.6
Same 99.85 103.42 0.97 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20403.23
Slower 138.73 151.37 0.92 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20404.2
Slower 72.85 515.56 0.14 alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 6.0.100-alpha.1.20421.6
Slower 78.85 90.76 0.87 ubuntu 18.04 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 5.0.100-rc.1.20418.3

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 36 (36 by maintainers)

Most upvoted comments

The container was missing libgdiplus on Alpine, but was able to resolve it by adding it from a http://dl-3.alpinelinux.org/alpine/edge/testing/.

I was able to validate that the fix gets the perf to be comparable with 3.1:

Alpine 3.1 5.0 6.0
CtorGivenSize<Int32>.Array 80.30 ns 737.6 ns 87.62 ns

We can also gather this information from the /sys/devices/system/cpu without doing Intel / AMD specific kung-fu. There is for example /sys/devices/system/cpu/cpu0/cache/index0/size for level 1 cache of cpu0, /sys/devices/system/cpu/cpu0/cache/index1/size for level 2 and /sys/devices/system/cpu/cpu0/cache/index2/size for level 3.

Adding Alpine is a good start, we can always re-evaluate if we find any other distro specific issues.

Others are better positioned to answer that one, off the top of my head I cannot remember Linux regressions that wouldn’t show up in either Ubuntu or Alpine.

imo, the better fix would be:

- #if defined(HOST_ARM64)
+ #if defined(TARGET_LINUX)

to keep the support for non-Linux Unix-like operating systems intact (macOS, FreeBSD, SunOS and since so forth).

It repros without WSL2 too.

The problem is that the GC is running like 100x more often than it should. It is likely problem in the budget computation. One of the places to check is PAL_GetLogicalProcessorCacheSizeFromOS.

Alpine is different since it uses the musl libc instead of GNU, thus its different from other distros. For this particular fix constants such as _SC_LEVEL1_DCACHE_SIZE are not defined for musl. Hence the fallback to using a difference method to retrieve the cache size was mostly alpine specific (there are several such subtle differences on Alpine).

@Lxiamail @billwert @DrewScoggins this is evidence I think that regular Alpine runs in the lab are important. This would have justified servicing, I think. I’m not sure where we left that conversation @billwert ?

@danmosemsft Yes, we have an email discussion about adding additional OS coverage in perf lab. We will looks into @adamsitnik’s finalized report of exercises’ data, and I’m trying to get .NET Core OS usage telemetry data. Hope we can identify the commonly used OSes and add them to perf lab.

@danmosemsft Yeah it was very helpful in pinpointing where the issue might be. @adamsitnik yeah I will create a separate issue to track how we can add a test/asserts for this.

I investigated this more and it appears none of the _SC_LEVEL1_DCACHE_SIZE are defined for Alpine (musl), so PAL_GetLogicalProcessorCacheSizeFromOS will always return 0.

Wonder if https://github.com/dotnet/runtime/pull/34488 caused the regression, since I notice this case is missing when compared to 3.1