roslyn: Major performance regression of the array initialization pattern

In https://github.com/dotnet/performance/pull/2534 I’ve bumped BDN version to see if @MichalStrehovsky fix (https://github.com/dotnet/BenchmarkDotNet/pull/2046) helped. It did, but new issue popped out.

86 out of 4334 benchmarks that were working fine in the past (2 months ago was the last time I run them), are now failing with PlatformNotSupportedException:

 System.PlatformNotSupportedException: Operation is not supported on this platform.
    at Internal.Runtime.CompilerHelpers.ThrowHelpers.ThrowPlatformNotSupportedException() + 0x24
    at System.Net.Primitives.Tests.IPAddressPerformanceTests.<ByteAddresses>d__0.MoveNext() + 0x45
    at System.Linq.Enumerable.<OfTypeIterator>d__65`1.MoveNext() + 0x6b
    at System.Linq.Enumerable.MaxInteger[TSource, TResult](IEnumerable`1, Func`2) + 0x48
    at System.Net.Primitives.Tests.IPAddressPerformanceTests..ctor() + 0x9d
    at BenchmarkDotNet.Autogenerated.Runnable_1..ctor() + 0x2a
    at BenchmarkDotNet.Autogenerated.Runnable_1.Run(IHost, String) + 0x6e
    at BenchmarkDotNet.Autogenerated.UniqueProgramName.AfterAssemblyLoadingAttached(String[]) + 0x1a2

for code that seems to be very simple:

https://github.com/dotnet/performance/blob/062e33e11c917ce408ce4b447552185d6565b035/src/benchmarks/micro/libraries/System.Net.Primitives/IPAddressPerformanceTests.cs#L17-L26

I have moved this code to a standalone app and was not able to repro the issue.

Repro steps:

git clone https://github.com/dotnet/performance.git
cd performance
py .\scripts\benchmarks_ci.py -f nativeaot7.0 --filter System.Net.Primitives.Tests.IPAddressPerformanceTests.Ctor_Span --bdn-arguments "--keepFiles true"

BDN reports the path where to project is stored (the last path in the line below):

// start dotnet  build -c Release -r win-x64 --no-restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 /p:Deterministic=true /p:Optimize=true /p:NuGetPackageRoot="D:\projects\performance\artifacts\packages" in D:\projects\performance\artifacts\bin\MicroBenchmarks\Release\net7.0\Job-NYBBDR

cc @jkotas

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (15 by maintainers)

Most upvoted comments

Yes, we can do all that, but the benefits are vanishingly small.

Part of this is the existing overhead. For large allocations, the costs can be fairly visible and there are cases where those are frequent enough or where they are happening in contexts where the costs do matter.

GC.AllocateUninitializedArray is also a security trap due to uninitialized data.

This is no more a trap than NativeMemory.Alloc and while NativeMemory.Alloc is often a suitable choice, there are many scenarios where it is not.

I kind of like that the API is a performance trap since it forces people to use it only in places where it is actually needed.

I don’t think that a performance oriented API should be a performance trap. If we want people to be aware of the potential pitfalls of it, that’s something that [UnsafeCallersOnly] or similar should be for. As the API stands today, it has paths in it to handle the small case but to use it “correctly” (even in cases where it is actually needed), the devs realistically need to duplicate those same checks.

I’m all for telling people “this is dangerous, be mindful of the holes”. However, I don’t think we should be intentionally making it more problematic to use correctly where there is a decently simple fix that can be made.

Quick standalone performance test:

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;

Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 100_000_000; i++)
    CreateInitializedArray();
Console.WriteLine(sw.ElapsedMilliseconds);

[MethodImpl(MethodImplOptions.NoInlining)]
static int[] CreateInitializedArray()
{
    return new int[] { 1, 2, 3, 4 };
}

Run using:

set DOTNET_TieredCompilation=0
dotnet run -c Release`

.NET 7 SDK Preview 6: 390ms Bleeding edge Roslyn (4.4.0-1.22371.1): 10899ms

As you can see, this change introduced 27x regression of the simple array initialization pattern used in this micro-benchmark.

BTW: I do not see any performance numbers in #62392 . In dotnet/runtime, we require performance numbers demonstrating the improvement for all performance optimization PRs. It is not unusual for performance optimizations to not have the intended effect.

(EDIT: Corrected the perf numbers.)

use it “correctly” (even in cases where it is actually needed), the devs realistically need to duplicate those same checks.

I disagree. We are using it correctly in our own CoreLib in StringBuilder, ArrayPool and number of other places. Number of these places were heavily micro-benchmarked when the uses of this API were introduced to make sure that there is no regression, and none of them have the checks duplicated.

there is a decently simple fix that can be made.

I do not think that the fix is simple. The GC is heavily optimized for giving you zero-initialized memory. If we were to build uninitialized memory path that is as optimized as the existing zero-initialized path, it would be fairly complex change, and it would be hard to make it pay-for-play.

Yes, we can do all that, but the benefits are vanishingly small.

GC.AllocateUninitializedArray is also a security trap due to uninitialized data. I kind of like that the API is a performance trap since it forces people to use it only in places where it is actually needed.

The IL for the method in question looks like this - there’s a GC::AllocateUninitializedArray instead of newarr. This is breaking pattern recognition in RyuJIT. In NativeAOT this is a PlatformNotSupported because RuntimeHelpers::InitializeArray needs to be expanded as an intrinsic. In CoreCLR, breaking the pattern recognition is a deoptimization. Good thing we found it.

Cc @jaredpar to make sure this is intentional new codegen for this.

RyuJIT will need to be adapted to recognize the new pattern.

.method private final hidebysig newslot virtual 
	instance bool MoveNext () cil managed 
{
	.override method instance bool [System.Runtime]System.Collections.IEnumerator::MoveNext()
	// Method begins at RVA 0x701e8
	// Header size: 12
	// Code size: 117 (0x75)
	.maxstack 4
	.locals init (
		[0] int32
	)

	IL_0000: ldarg.0
	IL_0001: ldfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0006: stloc.0
	IL_0007: ldloc.0
	IL_0008: switch (IL_001b, IL_0043, IL_006c)

	IL_0019: ldc.i4.0
	IL_001a: ret

	IL_001b: ldarg.0
	IL_001c: ldc.i4.m1
	IL_001d: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0022: ldarg.0
	IL_0023: ldc.i4.4
	IL_0024: ldc.i4.0
	IL_0025: call !!0[] [System.Runtime]System.GC::AllocateUninitializedArray<uint8>(int32, bool)
	IL_002a: dup
	IL_002b: ldtoken field int32 '<PrivateImplementationDetails>'::A58DB492CDCA7F4880696F0FD6743AE3D8F89D85DF07A95D10C4DD82D3C5881A
	IL_0030: call void [System.Runtime]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [System.Runtime]System.Array, valuetype [System.Runtime]System.RuntimeFieldHandle)
	IL_0035: stfld object System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>2__current'
	IL_003a: ldarg.0
	IL_003b: ldc.i4.1
	IL_003c: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0041: ldc.i4.1
	IL_0042: ret

	IL_0043: ldarg.0
	IL_0044: ldc.i4.m1
	IL_0045: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_004a: ldarg.0
	IL_004b: ldc.i4.s 16
	IL_004d: ldc.i4.0
	IL_004e: call !!0[] [System.Runtime]System.GC::AllocateUninitializedArray<uint8>(int32, bool)
	IL_0053: dup
	IL_0054: ldtoken field valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=16' '<PrivateImplementationDetails>'::E531DA00E6A1B6D170859CF4F00DF857B1338B4950DB57E2B81B6D7A2ACD8981
	IL_0059: call void [System.Runtime]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [System.Runtime]System.Array, valuetype [System.Runtime]System.RuntimeFieldHandle)
	IL_005e: stfld object System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>2__current'
	IL_0063: ldarg.0
	IL_0064: ldc.i4.2
	IL_0065: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_006a: ldc.i4.1
	IL_006b: ret

	IL_006c: ldarg.0
	IL_006d: ldc.i4.m1
	IL_006e: stfld int32 System.Net.Primitives.Tests.IPAddressPerformanceTests/'<ByteAddresses>d__0'::'<>1__state'
	IL_0073: ldc.i4.0
	IL_0074: ret
} // end of method '<ByteAddresses>d__0'::MoveNext