runtime: Potential performance regression in OrdinalIgnoreCase string comparision

Originally detected by the bot in https://github.com/DrewScoggins/performance-2/issues/4549 but not reported in dotnet/runtime (cc @DrewScoggins)

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 978.46 1223.95 0.80 +0 bimodal Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1405.76 1696.68 0.83 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1390.04 1662.75 0.84 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1228.43 1642.76 0.75 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 1689.86 2838.86 0.60 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Slower 1366.24 1803.72 0.76 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Slower 1014.33 1552.26 0.65 +0 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Slower 1355.82 1964.26 0.69 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Slower 858.70 1325.30 0.65 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Slower 984.06 1414.77 0.70 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 4173.22 6321.15 0.66 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 1394.92 1628.06 0.86 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1371.22 1630.47 0.84 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1402.69 1653.62 0.85 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1401.21 1614.74 0.87 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1378.12 1623.93 0.85 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1020.15 1478.16 0.69 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 1744.39 2038.20 0.86 +0 bimodal alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Slower 3021.64 3715.93 0.81 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Slower 1809.60 2855.11 0.63 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1987.02 2863.98 0.69 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 989.54 1249.97 0.79 +0 several? Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1446.38 2158.57 0.67 +0 Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 1816.65 2931.48 0.62 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1306.85 1927.69 0.68 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 1161.13 1668.36 0.70 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 1198.18 1747.52 0.69 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

image

and zooming:

image

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

https://github.com/dotnet/runtime/compare/f64246ce08fb7a58221b2b7c8e68f69c02522b0d...25f1800f1444a012217911e6f9522cabeb7672d1

category:performance theme:benchmarks

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

A few more notes – as with the case @EgorBo noted above

Method Job EnvironmentVariables Count Options Mean Error StdDev
Compare_Same_Upper Job-OOQIED DOTNET_JitDisablePgo=1 1024 (en-U(…)Case) [26] 917.7 ns 13.85 ns 15.95 ns
Compare_Same_Upper Job-LRXJIG DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1 1024 (en-U(…)Case) [26] 1,509.1 ns 10.26 ns 10.98 ns
Compare_Same_Upper Job-SQKDSE Empty 1024 (en-U(…)Case) [26] 1,373.2 ns 14.82 ns 15.85 ns

Looking at full pgo, we see it helps the 12345 case but not the other two. So more mysteries here to sort out; even if we see (presumably) high quality PGO data we don’t seem to benefit.

Default

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 40.49 ns 0.316 ns 0.280 ns 40.41 ns 40.23 ns 41.20 ns -
FormatterInt64 12345 18.96 ns 0.163 ns 0.152 ns 18.88 ns 18.80 ns 19.26 ns -
FormatterInt64 9223372036854775807 49.86 ns 0.152 ns 0.127 ns 49.82 ns 49.74 ns 50.14 ns -

No PGO

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 38.07 ns 0.104 ns 0.082 ns 38.04 ns 37.98 ns 38.22 ns -
FormatterInt64 12345 16.38 ns 0.142 ns 0.126 ns 16.37 ns 16.15 ns 16.59 ns -
FormatterInt64 9223372036854775807 48.11 ns 0.169 ns 0.141 ns 48.11 ns 47.91 ns 48.39 ns -

Full PGO

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 42.35 ns 0.523 ns 0.489 ns 42.48 ns 41.27 ns 42.98 ns -
FormatterInt64 12345 13.53 ns 0.723 ns 0.832 ns 13.68 ns 10.58 ns 14.48 ns -
FormatterInt64 9223372036854775807 52.26 ns 1.132 ns 1.059 ns 51.95 ns 50.32 ns 53.92 ns -

I wonder how much of this might be that BDN doesn’t run the Tier0 code sufficient times or something similar…? Playing around with --warmupCount I get different results for Full PGO:

Full PGO + --warmupCount 100

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 36.472 ns 0.2137 ns 0.1895 ns 36.347 ns 36.289 ns 36.814 ns -
FormatterInt64 12345 9.882 ns 0.0709 ns 0.0629 ns 9.878 ns 9.766 ns 9.994 ns -
FormatterInt64 9223372036854775807 43.894 ns 0.1425 ns 0.1263 ns 43.893 ns 43.696 ns 44.157 ns -

however that trick doesn’t help the Compare_Same_Upper performance:

Method Job EnvironmentVariables WarmupCount Count Options Mean Error StdDev
Compare_Same_Upper Job-ESPZVK DOTNET_JitDisablePgo=1 Default 1024 (en-U(…)Case) [26] 876.6 ns 9.96 ns 11.47 ns
Compare_Same_Upper Job-MAFOCS DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1 Default 1024 (en-U(…)Case) [26] 1,500.9 ns 5.40 ns 6.00 ns
Compare_Same_Upper Job-TMUCWV DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1 100 1024 (en-U(…)Case) [26] 1,510.0 ns 9.58 ns 11.03 ns
Compare_Same_Upper Job-ZEOHZJ Empty Default 1024 (en-U(…)Case) [26] 1,321.5 ns 18.99 ns 21.11 ns

There seem to be several issues in Compare_Same_Upper.

First is that the PGO training data does not see many cases of case-insensitive comparison. So in the compound expression

https://github.com/dotnet/runtime/blob/1b14c945a9c9ff8c90b8507fe8f0f9cccd20ad73/src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs#L28-L29

The first clause is true with likelihood 0.97. This means that the CSE charA | 0x20 is not deemed profitable by the jit as the second use is considered quite infrequent.

In the benchmark however this code path is taken quite often and so PGO imposes an extra cost. One possible fix here is to broaden the set of inputs we see during PGO training, though it’s also possible the likelihoods we see now are realistic.

We should arguably revisit the CSE costing heuristics as it does not seem like doing this extra CSE would actually cause any issues.

Second is that PGO data leads to poor layout decisions. The jit’s block layout algorithm is locally greedy and this leads to globally sub-optimal layouts. One such example happens early in the method where there is a control flow diamond; with PGO we see this diamond is biased with one block at 0.81 likelihood. Another happens later in the code where a PGO-rare path in a loop is moved and requires several jumps to rejoin the flow. In general the JIT is too aggressive in moving lower-frequency blocks out of line, doing so inhibits opportunities for jump elimination later on.

There is work anticipated here in .NET 7 but no easy fix in the meantime.

Third is that the early block reordering done by the JIT interferes with loop recognition, and the JIT mistakenly thinks there are two loops in the method (instead of one multi-exit loop). While concerning, it doesn’t seem to cause problems here as the loops are not currently optimizable but the fact that the JIT does the reordering so early is indicative of two shortcomings: (1) loop recognition is too pattern sensitive; (2) optimizing block order should generally wait until later in the phase pipeline.

This is also something we’ve seen elsewhere and may try and address in .NET 7.

Compare_DifferentFirstChar hits these same code paths (though just one pass through, not many) and so likely has the same root cause.

Compare_Same_Upper performance now back to where it was long ago, looks like https://github.com/dotnet/runtime/pull/85130 was the change responsible newplot - 2023-04-22T080252 554

Seems like this might not have been spotted by auto filing, going to double-check.

A it was https://github.com/dotnet/perf-autofiling-issues/issues/12928 that we evidently never triaged.

Drilling into Compare_Same_Upper shows the comments above are still relevant (except that we no longer do early block reordering).

The late block reordering scrambles the loop body, interposing non-loop blocks. This sort of reordering is risky unless you have very high confidence in your profile.

This is something I hope we can improve on in .NET 8.

I don’t see anything here we can address at this point in .NET 6, so am going to move this out of the 6.0 milestone.

The most interesting part is that FullPGO is still slower than NoPGO mode cc @AndyAyersMS (I’ve not investigated why yet):


|             Method |              Job |         EnvironmentVariables | Count |              Options |     Mean |    Error |  StdDev |
|------------------- |----------------- |----------------------------- |------ |--------------------- |---------:|---------:|--------:|
| Compare_Same_Upper |          FullPGO | DOTNET_TC_QuickJitForLoops=1 |  1024 | (en-U(...)Case) [26] | 908.5 ns |  2.94 ns | 2.45 ns |
| Compare_Same_Upper |           No PGO |       DOTNET_JitDisablePgo=1 |  1024 | (en-U(...)Case) [26] | 869.2 ns |  9.12 ns | 8.53 ns |
| Compare_Same_Upper |          Default |                        Empty |  1024 | (en-U(...)Case) [26] | 948.0 ns | 10.51 ns | 9.83 ns |

Standalone repro:


// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System.Collections.Generic;
using System.IO;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace System.Globalization.Tests
{
    [Config(typeof(ConfigWithCustomEnvVars))]
    [DisassemblyDiagnoser(maxDepth: 6, exportDiff: true)]
    public class StringEquality
    {
        private class ConfigWithCustomEnvVars : ManualConfig
        {
            private const string JitNoInline = "COMPlus_JitNoInline";

            public ConfigWithCustomEnvVars()
            {
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithId("Default"));
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60)
                    .WithEnvironmentVariables(new EnvironmentVariable("DOTNET_JitDisablePgo", "1"))
                    .WithId("No PGO"));
            }
        }

        private string _value, _same, _sameUpper, _diffAtFirstChar;

        public static IEnumerable<(CultureInfo CultureInfo, CompareOptions CompareOptions)> GetOptions()
        {
            yield return (new CultureInfo("en-US"), CompareOptions.OrdinalIgnoreCase);
        }

        [ParamsSource(nameof(GetOptions))]
        public (CultureInfo CultureInfo, CompareOptions CompareOptions) Options;

        [Params(1024)] // single execution path = single test case
        public int Count;

        [GlobalSetup]
        public void Setup()
        {
            // we are using part of Alice's Adventures in Wonderland text as test data
            char[] characters = File.ReadAllText(@"path\to\alice29.txt").Take(Count).ToArray();
            _value = new string(characters);
            _same = new string(characters);
            _sameUpper = _same.ToUpper();
            char[] copy = characters.ToArray();
            copy[0] = (char)(copy[0] + 1);
            _diffAtFirstChar = new string(copy);
        }

        [Benchmark] // the most work to do for IgnoreCase: every char needs to be compared and uppercased
        public int Compare_Same_Upper() => Options.CultureInfo.CompareInfo.Compare(_value, _sameUpper, Options.CompareOptions);

        public static void Main() => BenchmarkRunner.Run<StringEquality>();
    }
}

Asmdiff (NoPGO is on the left): https://www.diffchecker.com/2guAUS5i