runtime: Potential performance regression in OrdinalIgnoreCase string comparision

Originally detected by the bot in https://github.com/DrewScoggins/performance-2/issues/4549 but not reported in dotnet/runtime (cc @DrewScoggins)

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Same	978.46	1223.95	0.80	+0	bimodal	Windows 10.0.19043.1165	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Slower	1405.76	1696.68	0.83	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1390.04	1662.75	0.84	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1228.43	1642.76	0.75	+0		Windows 10.0.18363.1621	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	1689.86	2838.86	0.60	+0		Windows 8.1	X64	Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)	5.0.921.35908	6.0.21.45401
Slower	1366.24	1803.72	0.76	+0		Windows 10.0.19042.685	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701
Slower	1014.33	1552.26	0.65	+0		Windows 10.0.19043.1165	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	5.0.921.35908	6.0.21.41701
Slower	1355.82	1964.26	0.69	+0		Windows 10.0.22454	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	5.0.921.35908	6.0.21.41701
Slower	858.70	1325.30	0.65	+0		Windows 10.0.22451	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)	5.0.921.35908	6.0.21.41701
Slower	984.06	1414.77	0.70	+0		Windows 10.0.19042.1165	X64	Intel Core i9-9900T CPU 2.10GHz	5.0.921.35908	6.0.21.41701
Slower	4173.22	6321.15	0.66	+0		Windows 7 SP1	X64	Intel Core2 Duo CPU T9600 2.80GHz	5.0.721.25508	6.0.21.41701
Slower	1394.92	1628.06	0.86	+0		centos 8	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1371.22	1630.47	0.84	+0		debian 10	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1402.69	1653.62	0.85	+0		rhel 7	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1401.21	1614.74	0.87	+0		sles 15	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1378.12	1623.93	0.85	+0		opensuse-leap 15.3	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1020.15	1478.16	0.69	+0		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Same	1744.39	2038.20	0.86	+0	bimodal	alpine 3.13	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	5.0.921.35908	6.0.21.41701
Slower	3021.64	3715.93	0.81	+0		ubuntu 16.04	Arm64	Unknown processor	5.0.421.11614	6.0.21.41701
Slower	1809.60	2855.11	0.63	+0		Windows 10.0.19043.1165	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	1987.02	2863.98	0.69	+0		Windows 10.0.22000	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	989.54	1249.97	0.79	+0	several?	Windows 10.0.19043.1165	X86	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Slower	1446.38	2158.57	0.67	+0		Windows 10.0.18363.1621	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	1816.65	2931.48	0.62	+0		Windows 10.0.19043.1165	Arm	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	1306.85	1927.69	0.68	+0		macOS Big Sur 11.5.2	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	1161.13	1668.36	0.70	+0		macOS Big Sur 11.5.2	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	1198.18	1747.52	0.69	+0		macOS Big Sur 11.4	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

and zooming:

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

https://github.com/dotnet/runtime/compare/f64246ce08fb7a58221b2b7c8e68f69c02522b0d...25f1800f1444a012217911e6f9522cabeb7672d1

category:performance theme:benchmarks

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (20 by maintainers)

Most upvoted comments

A few more notes – as with the case @EgorBo noted above

Method	Job	EnvironmentVariables	Count	Options	Mean	Error	StdDev
Compare_Same_Upper	Job-OOQIED	DOTNET_JitDisablePgo=1	1024	(en-U(…)Case) [26]	917.7 ns	13.85 ns	15.95 ns
Compare_Same_Upper	Job-LRXJIG	DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1	1024	(en-U(…)Case) [26]	1,509.1 ns	10.26 ns	10.98 ns
Compare_Same_Upper	Job-SQKDSE	Empty	1024	(en-U(…)Case) [26]	1,373.2 ns	14.82 ns	15.85 ns

Looking at full pgo, we see it helps the 12345 case but not the other two. So more mysteries here to sort out; even if we see (presumably) high quality PGO data we don’t seem to benefit.

Default

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	40.49 ns	0.316 ns	0.280 ns	40.41 ns	40.23 ns	41.20 ns	-
FormatterInt64	12345	18.96 ns	0.163 ns	0.152 ns	18.88 ns	18.80 ns	19.26 ns	-
FormatterInt64	9223372036854775807	49.86 ns	0.152 ns	0.127 ns	49.82 ns	49.74 ns	50.14 ns	-

No PGO

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	38.07 ns	0.104 ns	0.082 ns	38.04 ns	37.98 ns	38.22 ns	-
FormatterInt64	12345	16.38 ns	0.142 ns	0.126 ns	16.37 ns	16.15 ns	16.59 ns	-
FormatterInt64	9223372036854775807	48.11 ns	0.169 ns	0.141 ns	48.11 ns	47.91 ns	48.39 ns	-

Full PGO

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	42.35 ns	0.523 ns	0.489 ns	42.48 ns	41.27 ns	42.98 ns	-
FormatterInt64	12345	13.53 ns	0.723 ns	0.832 ns	13.68 ns	10.58 ns	14.48 ns	-
FormatterInt64	9223372036854775807	52.26 ns	1.132 ns	1.059 ns	51.95 ns	50.32 ns	53.92 ns	-

I wonder how much of this might be that BDN doesn’t run the Tier0 code sufficient times or something similar…? Playing around with --warmupCount I get different results for Full PGO:

Full PGO + --warmupCount 100

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	36.472 ns	0.2137 ns	0.1895 ns	36.347 ns	36.289 ns	36.814 ns	-
FormatterInt64	12345	9.882 ns	0.0709 ns	0.0629 ns	9.878 ns	9.766 ns	9.994 ns	-
FormatterInt64	9223372036854775807	43.894 ns	0.1425 ns	0.1263 ns	43.893 ns	43.696 ns	44.157 ns	-

however that trick doesn’t help the Compare_Same_Upper performance:

Method	Job	EnvironmentVariables	WarmupCount	Count	Options	Mean	Error	StdDev
Compare_Same_Upper	Job-ESPZVK	DOTNET_JitDisablePgo=1	Default	1024	(en-U(…)Case) [26]	876.6 ns	9.96 ns	11.47 ns
Compare_Same_Upper	Job-MAFOCS	DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1	Default	1024	(en-U(…)Case) [26]	1,500.9 ns	5.40 ns	6.00 ns
Compare_Same_Upper	Job-TMUCWV	DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1	100	1024	(en-U(…)Case) [26]	1,510.0 ns	9.58 ns	11.03 ns
Compare_Same_Upper	Job-ZEOHZJ	Empty	Default	1024	(en-U(…)Case) [26]	1,321.5 ns	18.99 ns	21.11 ns

AndyAyersMS on Sep 22, 2021

There seem to be several issues in Compare_Same_Upper.

First is that the PGO training data does not see many cases of case-insensitive comparison. So in the compound expression

https://github.com/dotnet/runtime/blob/1b14c945a9c9ff8c90b8507fe8f0f9cccd20ad73/src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs#L28-L29

The first clause is true with likelihood 0.97. This means that the CSE charA | 0x20 is not deemed profitable by the jit as the second use is considered quite infrequent.

In the benchmark however this code path is taken quite often and so PGO imposes an extra cost. One possible fix here is to broaden the set of inputs we see during PGO training, though it’s also possible the likelihoods we see now are realistic.

We should arguably revisit the CSE costing heuristics as it does not seem like doing this extra CSE would actually cause any issues.

Second is that PGO data leads to poor layout decisions. The jit’s block layout algorithm is locally greedy and this leads to globally sub-optimal layouts. One such example happens early in the method where there is a control flow diamond; with PGO we see this diamond is biased with one block at 0.81 likelihood. Another happens later in the code where a PGO-rare path in a loop is moved and requires several jumps to rejoin the flow. In general the JIT is too aggressive in moving lower-frequency blocks out of line, doing so inhibits opportunities for jump elimination later on.

There is work anticipated here in .NET 7 but no easy fix in the meantime.

Third is that the early block reordering done by the JIT interferes with loop recognition, and the JIT mistakenly thinks there are two loops in the method (instead of one multi-exit loop). While concerning, it doesn’t seem to cause problems here as the loops are not currently optimizable but the fact that the JIT does the reordering so early is indicative of two shortcomings: (1) loop recognition is too pattern sensitive; (2) optimizing block order should generally wait until later in the phase pipeline.

This is also something we’ve seen elsewhere and may try and address in .NET 7.

Compare_DifferentFirstChar hits these same code paths (though just one pass through, not many) and so likely has the same root cause.

AndyAyersMS on Sep 21, 2021

Compare_Same_Upper performance now back to where it was long ago, looks like https://github.com/dotnet/runtime/pull/85130 was the change responsible newplot - 2023-04-22T080252 554

Seems like this might not have been spotted by auto filing, going to double-check.

A it was https://github.com/dotnet/perf-autofiling-issues/issues/12928 that we evidently never triaged.

AndyAyersMS on Apr 22, 2023

Drilling into Compare_Same_Upper shows the comments above are still relevant (except that we no longer do early block reordering).

The late block reordering scrambles the loop body, interposing non-loop blocks. This sort of reordering is risky unless you have very high confidence in your profile.

This is something I hope we can improve on in .NET 8.

AndyAyersMS on Jul 23, 2022

I don’t see anything here we can address at this point in .NET 6, so am going to move this out of the 6.0 milestone.

AndyAyersMS on Sep 22, 2021

The most interesting part is that FullPGO is still slower than NoPGO mode cc @AndyAyersMS (I’ve not investigated why yet):


|             Method |              Job |         EnvironmentVariables | Count |              Options |     Mean |    Error |  StdDev |
|------------------- |----------------- |----------------------------- |------ |--------------------- |---------:|---------:|--------:|
| Compare_Same_Upper |          FullPGO | DOTNET_TC_QuickJitForLoops=1 |  1024 | (en-U(...)Case) [26] | 908.5 ns |  2.94 ns | 2.45 ns |
| Compare_Same_Upper |           No PGO |       DOTNET_JitDisablePgo=1 |  1024 | (en-U(...)Case) [26] | 869.2 ns |  9.12 ns | 8.53 ns |
| Compare_Same_Upper |          Default |                        Empty |  1024 | (en-U(...)Case) [26] | 948.0 ns | 10.51 ns | 9.83 ns |

EgorBo on Sep 14, 2021

Standalone repro:


// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System.Collections.Generic;
using System.IO;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace System.Globalization.Tests
{
    [Config(typeof(ConfigWithCustomEnvVars))]
    [DisassemblyDiagnoser(maxDepth: 6, exportDiff: true)]
    public class StringEquality
    {
        private class ConfigWithCustomEnvVars : ManualConfig
        {
            private const string JitNoInline = "COMPlus_JitNoInline";

            public ConfigWithCustomEnvVars()
            {
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithId("Default"));
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60)
                    .WithEnvironmentVariables(new EnvironmentVariable("DOTNET_JitDisablePgo", "1"))
                    .WithId("No PGO"));
            }
        }

        private string _value, _same, _sameUpper, _diffAtFirstChar;

        public static IEnumerable<(CultureInfo CultureInfo, CompareOptions CompareOptions)> GetOptions()
        {
            yield return (new CultureInfo("en-US"), CompareOptions.OrdinalIgnoreCase);
        }

        [ParamsSource(nameof(GetOptions))]
        public (CultureInfo CultureInfo, CompareOptions CompareOptions) Options;

        [Params(1024)] // single execution path = single test case
        public int Count;

        [GlobalSetup]
        public void Setup()
        {
            // we are using part of Alice's Adventures in Wonderland text as test data
            char[] characters = File.ReadAllText(@"path\to\alice29.txt").Take(Count).ToArray();
            _value = new string(characters);
            _same = new string(characters);
            _sameUpper = _same.ToUpper();
            char[] copy = characters.ToArray();
            copy[0] = (char)(copy[0] + 1);
            _diffAtFirstChar = new string(copy);
        }

        [Benchmark] // the most work to do for IgnoreCase: every char needs to be compared and uppercased
        public int Compare_Same_Upper() => Options.CultureInfo.CompareInfo.Compare(_value, _sameUpper, Options.CompareOptions);

        public static void Main() => BenchmarkRunner.Run<StringEquality>();
    }
}

Asmdiff (NoPGO is on the left): https://www.diffchecker.com/2guAUS5i

EgorBo on Sep 14, 2021