runtime: Dynamic PGO Microbenchmark Regressions
This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of https://github.com/dotnet/runtime/issues/84264 which tracked regressions from PGO before it was enabled.
The report below is collated from the following autofiling reports.
- https://github.com/dotnet/perf-autofiling-issues/issues/17979
- https://github.com/dotnet/perf-autofiling-issues/issues/17982
- https://github.com/dotnet/perf-autofiling-issues/issues/17994
- https://github.com/dotnet/perf-autofiling-issues/issues/18096
- https://github.com/dotnet/perf-autofiling-issues/issues/18097
- https://github.com/dotnet/perf-autofiling-issues/issues/18103
- https://github.com/dotnet/perf-autofiling-issues/issues/18109
- https://github.com/dotnet/perf-autofiling-issues/issues/18111
- https://github.com/dotnet/perf-autofiling-issues/issues/18139
- https://github.com/dotnet/perf-autofiling-issues/issues/18151
- https://github.com/dotnet/perf-autofiling-issues/issues/18582
The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The “Score” is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. “Recent Score” is the current performance (as of 2023-0606) versus the non-PGO result; “Orig Score” is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example the results for System.Text.Json.Tests.Perf_Get.GetByte
, which has improved already).
Only the 36 entries with recent scores >= 1.3 are included; this leaves off approximately 220 more rows with scores between 1.3 or lower. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits, we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.
Each arch/os result is a hyperlink to the performance data graph for that benchmark. ~Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.~~
[edit: had to regenerate the table once already, as the scoring logic was off] [edit: have x64 win intel data now, new table. Not current results have shifted so table is somewhat different…]
Notes | Recent Score | Orig Score | arm64-lin-ampere | arm64-win-surface | arm64-win-ampere | x64-lin-intel | x64-win-intel | x64-win-amd | Benchmark |
---|---|---|---|---|---|---|---|---|---|
noise | 3.38 | 1.37 | 3.37 1.36 |
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “zqj”, Options: None) | |||||
noise | 3.36 | 1.37 | 3.36 1.37 |
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “zqj”, Options: NonBacktracking) | |||||
notes | 2.71 | 3.39 | 2.71 3.39 |
System.Memory.Span(Int32).EndsWith(Size: 4) | |||||
likely same as above | 2.62 | 3.03 | 2.55 2.27 |
2.59 3.04 |
System.Memory.Span(Int32).SequenceEqual(Size: 4) | ||||
likely same as above | 1.87 | 1.76 | 1.87 1.76 |
System.Memory.Span(Int32).SequenceCompareToDifferent(Size: 512) | |||||
(lack of) if conversion | 1.82 | 1.80 | 1.67 1.63 |
1.93 1.92 |
1.86 1.85 |
System.Tests.Perf_Random.NextSingle | |||
budget | 1.75 | 1.88 | 1.33 1.47 |
1.35 1.49 |
1.90 1.99 |
2.29 2.43 |
2.10 2.19 |
System.Text.Json.Tests.Perf_Get.GetInt16 | |
BDN | 1.73 | 2.81 | 3.55 3.54 |
1.89 2.00 |
1.28 4.73 |
1.32 2.01 |
1.39 2.68 |
System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000) | |
notes | 1.64 | 1.63 | 1.84 1.82 |
1.65 1.64 |
System.Tests.Perf_UInt32.TryParseHex(value: “0”) | ||||
budget | 1.61 | 1.70 | 1.27 1.44 |
1.28 1.46 |
1.24 1.18 |
2.09 2.17 |
2.25 2.33 |
1.86 1.94 |
System.Text.Json.Tests.Perf_Get.GetSByte |
bimodal | 1.61 | 1.59 | 1.60 1.58 |
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “Sherlock Holmes”, Options: Compiled) | |||||
cast expansion | 1.60 | 1.64 | 1.82 1.87 |
1.41 1.43 |
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSingleSegment | ||||
cast expansion | 1.58 | 1.62 | 1.58 1.62 |
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanTenSegments | |||||
cast expansion | 1.52 | 1.65 | 1.48 1.81 |
1.56 1.50 |
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSingleSegment | ||||
cast expansion | 1.50 | 1.73 | 1.88 2.13 |
1.20 1.41 |
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstTenSegments | ||||
likely same as span cases above | 1.48 | 1.28 | 1.48 1.28 |
System.Memory.Span(Int32).Reverse(Size: 4) | |||||
cast expansion | 1.47 | 1.44 | 1.47 1.44 |
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanSingleSegment | |||||
notes | 1.47 | 1.42 | 1.46 1.42 |
Benchstone.BenchF.InvMt.Test | |||||
unclear | 1.46 | 1.15 | 1.46 1.15 |
MicroBenchmarks.Serializers.Json_FromStream(MyEventsListerViewModel).DataContractJsonSerializer_ | |||||
fixed itself | 1.45 | 1.09 | 1.45 1.09 |
System.Tests.Perf_Uri.EscapeDataString(input: "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{ | |||||
unclear | 1.44 | 1.44 | 1.44 1.44 |
Burgers.Test1 | |||||
unclear | 1.43 | 1.27 | 1.43 1.27 |
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers) | |||||
unclear, linux arm64 only | 1.41 | 1.58 | 1.41 1.58 |
System.Text.Tests.Perf_StringBuilder.Append_Char_Capacity(length: 100000) | |||||
unclear, linux arm64 only | 1.39 | 1.62 | 1.39 1.62 |
BenchmarksGame.RegexRedux_5.RunBench(options: Compiled) | |||||
bimodal | 1.39 | 1.39 | 1.39 1.39 |
System.MathBenchmarks.Single.Min | |||||
bimodal | 1.39 | 1.39 | 1.39 1.39 |
System.MathBenchmarks.Single.Max | |||||
unclear, linux arm64 only | 1.39 | 1.32 | 1.39 1.32 |
System.IO.Pipes.Tests.Perf_NamedPipeStream.ReadWriteAsync(size: 1000000, Options: Asynchronous) | |||||
noise | 1.38 | 1.29 | 1.38 1.29 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000) | |||||
bimodal | 1.37 | 1.37 | 1.37 1.37 |
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: “zqj”, Options: Compiled) | |||||
notes | 1.37 | 1.36 | 1.26 1.29 |
1.42 1.43 |
1.24 1.28 |
1.60 1.48 |
System.Collections.Sort(IntStruct).Array(Size: 512) | ||
budget | 1.36 | 1.93 | 1.15 1.56 |
1.15 1.58 |
1.27 1.66 |
1.42 2.14 |
1.80 2.67 |
1.49 2.24 |
System.Text.Json.Tests.Perf_Get.GetByte |
noise | 1.35 | 1.31 | 1.36 1.33 |
System.Memory.Span(Char).IndexOfAnyTwoValues(Size: 512) | |||||
arm64 only; ldar vs dmb | 1.35 | 1.36 | 1.35 1.34 |
1.38 1.40 |
System.Collections.CtorFromCollection(Int32).ConcurrentBag(Size: 512) | ||||
fixed by physical promotion | 1.35 | 1.36 | 1.35 1.36 |
Devirtualization.EqualityComparer.ValueTupleCompareWrapped | |||||
budget | 1.34 | 1.42 | 1.42 1.26 |
1.28 1.38 |
1.35 1.44 |
1.35 1.42 |
1.35 1.55 |
1.31 1.46 |
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToStream(Mode: SourceGen) |
notes | 1.34 | 1.45 | 1.18 1.29 |
1.40 1.44 |
1.13 1.41 |
1.71 1.71 |
System.Collections.Sort(IntStruct).List(Size: 512) | ||
notes | 1.33 | 1.33 | 1.33 1.33 |
System.Tests.Perf_HashCode.Combine_1 | |||||
inlining different; exposed local | 1.33 | 1.32 | 1.34 1.33 |
1.32 1.32 |
System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple) | ||||
notes | 1.33 | 1.18 | 1.33 1.18 |
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings) | |||||
budget | 1.32 | 1.37 | 1.24 1.39 |
1.20 1.28 |
1.37 1.15 |
1.39 1.46 |
1.45 1.57 |
1.27 1.39 |
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToWriter(Mode: SourceGen) |
budget | 1.32 | 1.39 | 1.37 1.28 |
1.22 1.42 |
1.34 1.31 |
1.32 1.38 |
1.30 1.50 |
1.34 1.38 |
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToUtf8Bytes(Mode: SourceGen) |
budget | 1.31 | 1.88 | 1.15 1.59 |
1.18 1.62 |
1.03 1.37 |
1.49 2.22 |
1.66 2.49 |
1.49 2.24 |
System.Text.Json.Tests.Perf_Get.GetUInt16 |
budget | 1.31 | 1.33 | 1.38 1.25 |
1.20 1.23 |
1.23 1.26 |
1.35 1.46 |
1.40 1.40 |
1.41 1.43 |
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToString(Mode: SourceGen) |
jcc errata | 1.31 | 1.39 | 1.31 1.39 |
Span.Sorting.QuickSortSpan(Size: 512) | |||||
lack of cold inline exposes local | 1.31 | 1.29 | 1.31 1.31 |
1.31 1.27 |
System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple) | ||||
budget | 1.31 | 1.39 | 1.32 1.19 |
1.20 1.50 |
1.31 1.37 |
1.40 1.50 |
1.31 1.34 |
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeObjectProperty(Mode: SourceGen) | |
lack of ldapr |
1.30 | 1.30 | 1.29 1.30 |
1.30 1.30 |
System.Collections.CtorFromCollection(String).ConcurrentBag(Size: 512) |
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 45 (45 by maintainers)
Also note that bubble sort runs for a very long time, and so likely BDN + lab customization is not reliably measuring the tier1 codegen , but instead some mixture of Tier0, Tier0 + instrumentation, OSR, and or R2R code.
System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)
Win-x64 only
This one repros
BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.100-preview.6.23330.14 [Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2 Job-RBZAAU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-AJWTXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Issue here seems to be that with PGO we mark a call site that takes V05 (struct local) as rare and don’t inline it, and so V05 ends up getting address exposed and has more expensive copy semantics.
@egorbo example where not doing an inline in a cold block impacts codegen in a hot block.