runtime: Dynamic PGO Microbenchmark Regressions

This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of https://github.com/dotnet/runtime/issues/84264 which tracked regressions from PGO before it was enabled.

The report below is collated from the following autofiling reports.

The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The “Score” is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. “Recent Score” is the current performance (as of 2023-0606) versus the non-PGO result; “Orig Score” is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example the results for System.Text.Json.Tests.Perf_Get.GetByte, which has improved already).

Only the 36 entries with recent scores >= 1.3 are included; this leaves off approximately 220 more rows with scores between 1.3 or lower. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits, we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.

Each arch/os result is a hyperlink to the performance data graph for that benchmark. ~Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.~~

[edit: had to regenerate the table once already, as the scoring logic was off] [edit: have x64 win intel data now, new table. Not current results have shifted so table is somewhat different…]

cc @dotnet/jit-contrib

Notes Recent Score Orig Score arm64-lin-ampere arm64-win-surface arm64-win-ampere x64-lin-intel x64-win-intel x64-win-amd Benchmark
noise 3.38 1.37 3.37
1.36
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “zqj”, Options: None)
noise 3.36 1.37 3.36
1.37
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “zqj”, Options: NonBacktracking)
notes 2.71 3.39 2.71
3.39
System.Memory.Span(Int32).EndsWith(Size: 4)
likely same as above 2.62 3.03 2.55
2.27
2.59
3.04
System.Memory.Span(Int32).SequenceEqual(Size: 4)
likely same as above 1.87 1.76 1.87
1.76
System.Memory.Span(Int32).SequenceCompareToDifferent(Size: 512)
(lack of) if conversion 1.82 1.80 1.67
1.63
1.93
1.92
1.86
1.85
System.Tests.Perf_Random.NextSingle
budget 1.75 1.88 1.33
1.47
1.35
1.49
1.90
1.99
2.29
2.43
2.10
2.19
System.Text.Json.Tests.Perf_Get.GetInt16
BDN 1.73 2.81 3.55
3.54
1.89
2.00
1.28
4.73
1.32
2.01
1.39
2.68
System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)
notes 1.64 1.63 1.84
1.82
1.65
1.64
System.Tests.Perf_UInt32.TryParseHex(value: “0”)
budget 1.61 1.70 1.27
1.44
1.28
1.46
1.24
1.18
2.09
2.17
2.25
2.33
1.86
1.94
System.Text.Json.Tests.Perf_Get.GetSByte
bimodal 1.61 1.59 1.60
1.58
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “Sherlock Holmes”, Options: Compiled)
cast expansion 1.60 1.64 1.82
1.87
1.41
1.43
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSingleSegment
cast expansion 1.58 1.62 1.58
1.62
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanTenSegments
cast expansion 1.52 1.65 1.48
1.81
1.56
1.50
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSingleSegment
cast expansion 1.50 1.73 1.88
2.13
1.20
1.41
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstTenSegments
likely same as span cases above 1.48 1.28 1.48
1.28
System.Memory.Span(Int32).Reverse(Size: 4)
cast expansion 1.47 1.44 1.47
1.44
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanSingleSegment
notes 1.47 1.42 1.46
1.42
Benchstone.BenchF.InvMt.Test
unclear 1.46 1.15 1.46
1.15
MicroBenchmarks.Serializers.Json_FromStream(MyEventsListerViewModel).DataContractJsonSerializer_
fixed itself 1.45 1.09 1.45
1.09
System.Tests.Perf_Uri.EscapeDataString(input: "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{
unclear 1.44 1.44 1.44
1.44
Burgers.Test1
unclear 1.43 1.27 1.43
1.27
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers)
unclear, linux arm64 only 1.41 1.58 1.41
1.58
System.Text.Tests.Perf_StringBuilder.Append_Char_Capacity(length: 100000)
unclear, linux arm64 only 1.39 1.62 1.39
1.62
BenchmarksGame.RegexRedux_5.RunBench(options: Compiled)
bimodal 1.39 1.39 1.39
1.39
System.MathBenchmarks.Single.Min
bimodal 1.39 1.39 1.39
1.39
System.MathBenchmarks.Single.Max
unclear, linux arm64 only 1.39 1.32 1.39
1.32
System.IO.Pipes.Tests.Perf_NamedPipeStream.ReadWriteAsync(size: 1000000, Options: Asynchronous)
noise 1.38 1.29 1.38
1.29
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000)
bimodal 1.37 1.37 1.37
1.37
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “zqj”, Options: Compiled)
notes 1.37 1.36 1.26
1.29
1.42
1.43
1.24
1.28
1.60
1.48
System.Collections.Sort(IntStruct).Array(Size: 512)
budget 1.36 1.93 1.15
1.56
1.15
1.58
1.27
1.66
1.42
2.14
1.80
2.67
1.49
2.24
System.Text.Json.Tests.Perf_Get.GetByte
noise 1.35 1.31 1.36
1.33
System.Memory.Span(Char).IndexOfAnyTwoValues(Size: 512)
arm64 only; ldar vs dmb 1.35 1.36 1.35
1.34
1.38
1.40
System.Collections.CtorFromCollection(Int32).ConcurrentBag(Size: 512)
fixed by physical promotion 1.35 1.36 1.35
1.36
Devirtualization.EqualityComparer.ValueTupleCompareWrapped
budget 1.34 1.42 1.42
1.26
1.28
1.38
1.35
1.44
1.35
1.42
1.35
1.55
1.31
1.46
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToStream(Mode: SourceGen)
notes 1.34 1.45 1.18
1.29
1.40
1.44
1.13
1.41
1.71
1.71
System.Collections.Sort(IntStruct).List(Size: 512)
notes 1.33 1.33 1.33
1.33
System.Tests.Perf_HashCode.Combine_1
inlining different; exposed local 1.33 1.32 1.34
1.33
1.32
1.32
System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)
notes 1.33 1.18 1.33
1.18
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings)
budget 1.32 1.37 1.24
1.39
1.20
1.28
1.37
1.15
1.39
1.46
1.45
1.57
1.27
1.39
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToWriter(Mode: SourceGen)
budget 1.32 1.39 1.37
1.28
1.22
1.42
1.34
1.31
1.32
1.38
1.30
1.50
1.34
1.38
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToUtf8Bytes(Mode: SourceGen)
budget 1.31 1.88 1.15
1.59
1.18
1.62
1.03
1.37
1.49
2.22
1.66
2.49
1.49
2.24
System.Text.Json.Tests.Perf_Get.GetUInt16
budget 1.31 1.33 1.38
1.25
1.20
1.23
1.23
1.26
1.35
1.46
1.40
1.40
1.41
1.43
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToString(Mode: SourceGen)
jcc errata 1.31 1.39 1.31
1.39
Span.Sorting.QuickSortSpan(Size: 512)
lack of cold inline exposes local 1.31 1.29 1.31
1.31
1.31
1.27
System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple)
budget 1.31 1.39 1.32
1.19
1.20
1.50
1.31
1.37
1.40
1.50
1.31
1.34
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeObjectProperty(Mode: SourceGen)
lack of ldapr 1.30 1.30 1.29
1.30
1.30
1.30
System.Collections.CtorFromCollection(String).ConcurrentBag(Size: 512)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 45 (45 by maintainers)

Most upvoted comments

Please keep in mind that both bubble sort and IndexOf are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.

Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in.

Also note that bubble sort runs for a very long time, and so likely BDN + lab customization is not reliably measuring the tier1 codegen , but instead some mixture of Tier0, Tier0 + instrumentation, OSR, and or R2R code.

System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)

Win-x64 only

image

This one repros

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.100-preview.6.23330.14 [Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2 Job-RBZAAU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-AJWTXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Segment Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Slice_Repeat Job-RBZAAU \base-rel\corerun.exe Multiple 32.03 ns 0.020 ns 0.017 ns 32.03 ns 32.01 ns 32.06 ns 1.00 - NA
Slice_Repeat Job-AJWTXX \diff-rel\corerun.exe Multiple 43.23 ns 0.274 ns 0.243 ns 43.16 ns 42.93 ns 43.75 ns 1.35 - NA

Issue here seems to be that with PGO we mark a call site that takes V05 (struct local) as rare and don’t inline it, and so V05 ends up getting address exposed and has more expensive copy semantics.

@egorbo example where not doing an inline in a cold block impacts codegen in a hot block.

base
01.75%   8.8E+05     ?        Unknown
56.39%   2.828E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
13.86%   6.95E+06    Tier-1   [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
10.91%   5.47E+06    Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
09.71%   4.87E+06    native   coreclr.dll
06.28%   3.15E+06    Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
00.62%   3.1E+05     Tier-1   [7c853c35-6121-4c85-8327-3f1f8585f3b1]Runnable_0.WorkloadActionUnroll(int64)
00.28%   1.4E+05     native   clrjit.dll
00.12%   6E+04       native   ntoskrnl.exe
00.06%   3E+04       native   ntdll.dll

diff

00.69%   3.5E+05     ?        Unknown
74.57%   3.797E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
11.82%   6.02E+06    Tier-1   [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
05.32%   2.71E+06    native   coreclr.dll
03.69%   1.88E+06    Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
03.18%   1.62E+06    Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
00.37%   1.9E+05     native   ntoskrnl.exe
00.26%   1.3E+05     native   clrjit.dll