runtime: Potential performance regression in OrdinalIgnoreCase string comparision
Originally detected by the bot in https://github.com/DrewScoggins/performance-2/issues/4549 but not reported in dotnet/runtime (cc @DrewScoggins)
System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))
Result | Base | Diff | Ratio | Alloc Delta | Modality | Operating System | Bit | Processor Name | Base V | Diff V |
---|---|---|---|---|---|---|---|---|---|---|
Same | 978.46 | 1223.95 | 0.80 | +0 | bimodal | Windows 10.0.19043.1165 | X64 | AMD Ryzen Threadripper PRO 3945WX 12-Cores | 5.0.921.35908 | 6.0.21.41701 |
Slower | 1405.76 | 1696.68 | 0.83 | +0 | Windows 10.0.20348 | X64 | AMD EPYC 7452 | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1390.04 | 1662.75 | 0.84 | +0 | Windows 10.0.20348 | X64 | AMD EPYC 7452 | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1228.43 | 1642.76 | 0.75 | +0 | Windows 10.0.18363.1621 | X64 | Intel Xeon CPU E5-1650 v4 3.60GHz | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1689.86 | 2838.86 | 0.60 | +0 | Windows 8.1 | X64 | Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) | 5.0.921.35908 | 6.0.21.45401 | |
Slower | 1366.24 | 1803.72 | 0.76 | +0 | Windows 10.0.19042.685 | X64 | Intel Core i7-5557U CPU 3.10GHz (Broadwell) | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1014.33 | 1552.26 | 0.65 | +0 | Windows 10.0.19043.1165 | X64 | Intel Core i7-6700 CPU 3.40GHz (Skylake) | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1355.82 | 1964.26 | 0.69 | +0 | Windows 10.0.22454 | X64 | Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 858.70 | 1325.30 | 0.65 | +0 | Windows 10.0.22451 | X64 | Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 984.06 | 1414.77 | 0.70 | +0 | Windows 10.0.19042.1165 | X64 | Intel Core i9-9900T CPU 2.10GHz | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 4173.22 | 6321.15 | 0.66 | +0 | Windows 7 SP1 | X64 | Intel Core2 Duo CPU T9600 2.80GHz | 5.0.721.25508 | 6.0.21.41701 | |
Slower | 1394.92 | 1628.06 | 0.86 | +0 | centos 8 | X64 | AMD EPYC 7452 | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1371.22 | 1630.47 | 0.84 | +0 | debian 10 | X64 | AMD EPYC 7452 | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1402.69 | 1653.62 | 0.85 | +0 | rhel 7 | X64 | AMD EPYC 7452 | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1401.21 | 1614.74 | 0.87 | +0 | sles 15 | X64 | AMD EPYC 7452 | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1378.12 | 1623.93 | 0.85 | +0 | opensuse-leap 15.3 | X64 | AMD EPYC 7452 | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1020.15 | 1478.16 | 0.69 | +0 | ubuntu 18.04 | X64 | Intel Xeon CPU E5-1650 v4 3.60GHz | 5.0.921.35908 | 6.0.21.41701 | |
Same | 1744.39 | 2038.20 | 0.86 | +0 | bimodal | alpine 3.13 | X64 | Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) | 5.0.921.35908 | 6.0.21.41701 |
Slower | 3021.64 | 3715.93 | 0.81 | +0 | ubuntu 16.04 | Arm64 | Unknown processor | 5.0.421.11614 | 6.0.21.41701 | |
Slower | 1809.60 | 2855.11 | 0.63 | +0 | Windows 10.0.19043.1165 | Arm64 | Microsoft SQ1 3.0 GHz | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1987.02 | 2863.98 | 0.69 | +0 | Windows 10.0.22000 | Arm64 | Microsoft SQ1 3.0 GHz | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 989.54 | 1249.97 | 0.79 | +0 | several? | Windows 10.0.19043.1165 | X86 | AMD Ryzen Threadripper PRO 3945WX 12-Cores | 5.0.921.35908 | 6.0.21.41701 |
Slower | 1446.38 | 2158.57 | 0.67 | +0 | Windows 10.0.18363.1621 | X86 | Intel Xeon CPU E5-1650 v4 3.60GHz | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1816.65 | 2931.48 | 0.62 | +0 | Windows 10.0.19043.1165 | Arm | Microsoft SQ1 3.0 GHz | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1306.85 | 1927.69 | 0.68 | +0 | macOS Big Sur 11.5.2 | X64 | Intel Core i5-4278U CPU 2.60GHz (Haswell) | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1161.13 | 1668.36 | 0.70 | +0 | macOS Big Sur 11.5.2 | X64 | Intel Core i7-4870HQ CPU 2.50GHz (Haswell) | 5.0.921.35908 | 6.0.21.41701 | |
Slower | 1198.18 | 1747.52 | 0.69 | +0 | macOS Big Sur 11.4 | X64 | Intel Core i7-5557U CPU 3.10GHz (Broadwell) | 5.0.921.35908 | 6.0.21.41701 |
Repro:
git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper
A quick look at the historical data
and zooming:
It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):
category:performance theme:benchmarks
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (20 by maintainers)
A few more notes – as with the case @EgorBo noted above
Looking at full pgo, we see it helps the
12345
case but not the other two. So more mysteries here to sort out; even if we see (presumably) high quality PGO data we don’t seem to benefit.Default
No PGO
Full PGO
I wonder how much of this might be that BDN doesn’t run the Tier0 code sufficient times or something similar…? Playing around with
--warmupCount
I get different results for Full PGO:Full PGO +
--warmupCount 100
however that trick doesn’t help the
Compare_Same_Upper
performance:There seem to be several issues in
Compare_Same_Upper
.First is that the PGO training data does not see many cases of case-insensitive comparison. So in the compound expression
https://github.com/dotnet/runtime/blob/1b14c945a9c9ff8c90b8507fe8f0f9cccd20ad73/src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs#L28-L29
The first clause is true with likelihood 0.97. This means that the CSE
charA | 0x20
is not deemed profitable by the jit as the second use is considered quite infrequent.In the benchmark however this code path is taken quite often and so PGO imposes an extra cost. One possible fix here is to broaden the set of inputs we see during PGO training, though it’s also possible the likelihoods we see now are realistic.
We should arguably revisit the CSE costing heuristics as it does not seem like doing this extra CSE would actually cause any issues.
Second is that PGO data leads to poor layout decisions. The jit’s block layout algorithm is locally greedy and this leads to globally sub-optimal layouts. One such example happens early in the method where there is a control flow diamond; with PGO we see this diamond is biased with one block at 0.81 likelihood. Another happens later in the code where a PGO-rare path in a loop is moved and requires several jumps to rejoin the flow. In general the JIT is too aggressive in moving lower-frequency blocks out of line, doing so inhibits opportunities for jump elimination later on.
There is work anticipated here in .NET 7 but no easy fix in the meantime.
Third is that the early block reordering done by the JIT interferes with loop recognition, and the JIT mistakenly thinks there are two loops in the method (instead of one multi-exit loop). While concerning, it doesn’t seem to cause problems here as the loops are not currently optimizable but the fact that the JIT does the reordering so early is indicative of two shortcomings: (1) loop recognition is too pattern sensitive; (2) optimizing block order should generally wait until later in the phase pipeline.
This is also something we’ve seen elsewhere and may try and address in .NET 7.
Compare_DifferentFirstChar
hits these same code paths (though just one pass through, not many) and so likely has the same root cause.Compare_Same_Upper
performance now back to where it was long ago, looks like https://github.com/dotnet/runtime/pull/85130 was the change responsibleSeems like this might not have been spotted by auto filing, going to double-check.
A it was https://github.com/dotnet/perf-autofiling-issues/issues/12928 that we evidently never triaged.
Drilling into
Compare_Same_Upper
shows the comments above are still relevant (except that we no longer do early block reordering).The late block reordering scrambles the loop body, interposing non-loop blocks. This sort of reordering is risky unless you have very high confidence in your profile.
This is something I hope we can improve on in .NET 8.
I don’t see anything here we can address at this point in .NET 6, so am going to move this out of the 6.0 milestone.
The most interesting part is that FullPGO is still slower than NoPGO mode cc @AndyAyersMS (I’ve not investigated why yet):
Standalone repro:
Asmdiff (NoPGO is on the left): https://www.diffchecker.com/2guAUS5i