runtime: Performance Regression: UTF8 Span Methods 3x slower on AMD Platforms (NetCore2.1 -> 3.1)
Performance of some UTF8 string methods seems to have regressed in a major way on AMD platforms in .NET Core 3.1. Others may be affected as well.
Test:
[SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.NetCoreApp21)]
[SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.NetCoreApp31)]
public class Benchmark
{
const string testString = "hello world";
private static readonly Encoding Utf8NoBom = new UTF8Encoding(false);
private readonly byte[] writeBuffer = new byte[1024];
private readonly byte[] readBuffer = Utf8NoBom.GetBytes(testString);
[Benchmark]
public void WriteString()
{
Utf8NoBom.GetBytes(testString.AsSpan(), this.writeBuffer.AsSpan());
}
[Benchmark]
public void ReadString()
{
Utf8NoBom.GetString(this.readBuffer.AsSpan());
}
static void Main(string[] args)
{
BenchmarkRunner.Run<Benchmark>();
}
}
AMD system results:
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=3.1.101
[Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Job-RNGYVQ : .NET Core 2.1.15 (CoreCLR 4.6.28325.01, CoreFX 4.6.28327.02), X64 RyuJIT
Job-ILXJIL : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
| Method | Runtime | Mean | Error | StdDev |
|------------ |-------------- |---------:|---------:|---------:|
| WriteString | .NET Core 2.1 | 24.68 ns | 0.187 ns | 0.175 ns |
| ReadString | .NET Core 2.1 | 23.33 ns | 0.373 ns | 0.349 ns |
| WriteString | .NET Core 3.1 | 70.51 ns | 0.787 ns | 0.736 ns |
| ReadString | .NET Core 3.1 | 86.46 ns | 0.153 ns | 0.143 ns |
Intel system results:
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.101
[Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Job-POUEZF : .NET Core 2.1.15 (CoreCLR 4.6.28325.01, CoreFX 4.6.28327.02), X64 RyuJIT
Job-JZNDHJ : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
| Method | Runtime | Mean | Error | StdDev |
|------------ |-------------- |---------:|---------:|---------:|
| WriteString | .NET Core 2.1 | 34.38 ns | 0.709 ns | 1.204 ns |
| ReadString | .NET Core 2.1 | 31.24 ns | 0.652 ns | 1.376 ns |
| WriteString | .NET Core 3.1 | 13.61 ns | 0.301 ns | 0.519 ns |
| ReadString | .NET Core 3.1 | 29.86 ns | 0.623 ns | 1.340 ns |
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 37 (35 by maintainers)
@jkotas I do understand your point of view, but ultimately, once the intrinsics cat is out of the bag, a world of complexity comes with it. Sure, you could postpone supporting/dealing with this to some extent, but ultimately this
PDEP
/BMI2
is an early/light example of what is to come in the future:PDEP
, as @tannergooding pointed out, a wide variety of quirks associated with this world of bleeding edge instructions and different models from the same manufacturer exhibit wildly different results when instructed to execute specific instructions.It seems very arbitrary to draw a line around what sort of complex view developers “get to see” accepting partial solutions like the
IsSupported
boolean property, while rejectingCPUID
and its likes for the general case.As an example, Let’s assume that at some future point in time, CoreCLR will support AVX512 to some extent, if only to be able to support
VNNI
extensions for ML.NET or any of the 20 or so instruction families that come in future processors, simply because you will not be able to withstand the pressure to provide the associated performance/functionality that comes with it:I assume we’ll all agree (?) that AVX512 support will be exposed through an imaginary
AVX512F.IsSupported
property, much likeAVX2
is exposed today? What would prevent some future developer from detecting such AVX512 support throughAVX512F.IsSupported
property, using that as a “signal” to provide a different AVX2 code-path, under the assumption that the full set of 32ymm0-31
registers would be supported by the JIT during code-generation? This is essentially an abuse of support you would have to provide in order to expose code-paths that you never intended to support (e.g. multiple code paths per # of AVX2 registers available on a given CPU). This is essentially equivalent to exposing the actual data through CPUID in a clear and complete way allowing developers to query the CPU how many registers are supported in an explicit way.Querying the CPU manfacturer, through CPUID, as per #785, to detect an AMD processor, or querying cache latency / line sizes / structure shouldn’t be considered as radically different than getting partial CPUID “bits” leaked out through the
IsSupported
flags.I guess that my argument is, and perhaps this is also the problem with it, that you either accept your fate and the need to support these exotic instructions that come with their complexities because you wish to be able to play that game and compete with the programming language that do, or you don’t. I personally don’t see so much of a middle ground that makes sense. And I can easily see a variaty of ways where over-simplifying will ultimately fail.
Then again, I’m not the one tasked to provide and maintain this complexity, so ╮( ˘ 、 ˘ )╭
There are exceptions to every rule and System.Runtime.Intrinsics violates most of the standard simplicity rules the framework has. It is ~1500 hardware specific APIs (and that is practically doubling with the introduction of the ARM intrinsics). It requires the user to write unsafe code, to code multiple similar algorithms for the various paths to ensure good codegen is possible on all paths and that a software fallback is available.
There can already be large performance differences for a given instruction depending on what microarchitecture you are comparing; it isn’t just limited to PDEP/PEXT (which happens to be an extreme case). If you go back to pre-AVX machines; you will get severe penalties for using unaligned loads/stores. If you use non-temporal stores (even on modern hardware) for the wrong scenario, you can be looking at ~400 cycle latency rather than 1 cycle latency; but used in the right scenario (such as large memcpy), you can achieve over a 2x perf increase. If you compare “Broadwell” to the newer “Skylake-X” (https://www.agner.org/optimize/instruction_tables.pdf), there are several instructions that have gotten slower and several that have gotten faster (sometimes significantly). Developers using hardware intrinsics need to consider these scenarios; us included.
I believe this is going to set a bad precedent for the hardware intrinsics. Especially if this happens to a larger ISA where a couple of instructions happen to behave poorly. For example, we may eventually add AVX-512 support. That support is relatively new on Intel and doesn’t exist on AMD yet, but may exist there someday. AVX-512 is split into multiple feature ISAs and exposes ranges of instructions. It may be that one of the two (Intel or AMD) has significantly worse performance on an instruction as compared to the other. Someone will likewise log a bug and then this will be looked to as an example of how it should be handled.
In my mind, the correct thing is to just reiterate that this is an extremely low level feature that allows you to fine tune performance, but as part of that, the amount of code you need to write and what considerations you need to take are also greatly expanded (it is effectively writing pseudo-inline assembly after all, just without control over register allocation and spilling).
Yes, which is something developers already need to be concerned about. They already need to consider manufacturer and microarchitecture differences because what works well on “X” may not work well on “Y”. There is generally some pattern that works well on both, but that is not always the case/scenario.
Neither is disabling the functionality for AMD systems. We will still have the consideration of Znver1/Znver2 vs the new ZnverX that has it be faster. CPUID is meant to expose hardware specific information such as the
manufacturer ID
,vendor
, andmodel
numbers. It can therefore be used to limit code paths to manufacturers (IsGenuineIntel
vsIsAuthenticAMD
) or even specific CPU models (IsAuthenticAMD && (ExtendedModelId < value)
). The difference is whether it is internalized to the JIT and prevents use of instructions on hardware that supports it or if it puts the responsibility on the developer to ensure the code path is efficient on all hardware they plan on targeting (which I believe is more inline with the principles of the feature and what developers already need to consider).My belief is that it goes against the principles of the hardware intrinsics. They were designed to give developers the ability to access the functionality available in the underlying hardware. While one of the main focuses is performance, there is no possible way for us to “cleanly” expose only things which are always performant (as it greatly depends on the use-case etc). Even for PDEP/PEXT (on AMD); there are some cases where it takes as few as 18 cycles and some cases where it takes as many as almost 300.
Developers already have to code support for vastly different CPUs (ARM vs x86) and for different hardware levels (SSE/SSE2 as a baseline vs SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, BMI1, BMI2, LZCNT, POPCNT etc; and mixtures there-in). They also already need to consider microarchitectural differences to an extent (instructions may perform significantly worse on some older hardware; the cost of using unaligned vector loads/stores may be significantly greater or lesser; etc). Allowing them to explicitly check for the manufacturer and model/family isn’t a great jump from there and is something that has already been requested by external customers.
I would be strongly against this approach both in terms of how the hardware intrinsics have been exposed and as someone who has an AMD processor.
Bmi2
exposes 10 APIs, only 4 of which have an issue. We shouldn’t restrict the ability to use the other 6 APIs just because 4 of them have bad perf.We have provided no guarantees that using a given hardware intrinsic will be “performant”; only that it will emit the instruction a given API is mapped to. It has always been up to the user to utilize them efficiently (which includes pipelining, dealing with micro-architectural differences, etc). I also don’t think it will be a point in time issue. Not all customers upgrade their CPUs every 3-5 years and we’ll be in the same position for the AMD Zen1/2 CPUs even if Zen3 ends up resolving the issue. Looking at the CPUs that Azure lists (https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/), the majority are “Broadwell” which is circa 2013; just to give a rough timeline on how long I think we would need to support the workaround if implemented.
I’m not convinced we need a workaround. Looking at the Intel numbers with BMI2 enabled and the AMD numbers with BMI2 disabled; they are actually similar:
We should likely, first and foremost, get some numbers on how much “gain” Intel is getting via PDEP/PEXT here. My initial speculation is that they are not actually providing that great of a benefit and we would likely minimally regress Intel (while significantly boosting AMD) by just dropping their usage in the framework (at least here where it is only used “once” per call; and not in a hot loop).
I don’t believe it will be significantly more complex. We are already testing the various ISA combinations. We will continue to get full coverage of code paths even if we codify a few paths to do
if (Isa.IsSupported && !Cpuid.IsAuthenticAMD) { } else { /* code path where Isa is not supported */ }
(although, as per the above; I don’t believe this is required).I’m torn on whether the runtime lying about the intrinsic being supported is the correct approach. That seems put the runtime into the position of making a value judgment. Up until now the intrinsics have been only raw capabilities checks.
Loose some non-flag altering instructions (probably not high usage); and
BZHI
which can be replaced by[src & (1 << inx)-1]
which is the downside and doesn’t seem too high?However between 18x and 290x times slower for
PDEP
andPEXT
makes them completely undesirable instructions to use?Can provide a more general purpose fallback versions in
BitOperations
for .NET 5.0?The model for the hardware intrinsic is complex already. This would make it even more complex than what it is today.
My suggestion for how to fix this issue would be to make
Bmi2.IsSupported
false on AMD64 processors. Once/if the AMD64 processors get this performance bug fixed, we can update the runtime to detect the versions with the fix and start returning true from this method again.