runtime: Performance Regression: UTF8 Span Methods 3x slower on AMD Platforms (NetCore2.1 -> 3.1)

Performance of some UTF8 string methods seems to have regressed in a major way on AMD platforms in .NET Core 3.1. Others may be affected as well.

Test:

    [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.NetCoreApp21)]
    [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.NetCoreApp31)]
    public class Benchmark
    {
        const string testString = "hello world";
        private static readonly Encoding Utf8NoBom = new UTF8Encoding(false);

        private readonly byte[] writeBuffer = new byte[1024];
        private readonly byte[] readBuffer = Utf8NoBom.GetBytes(testString);

        [Benchmark]
        public void WriteString()
        {
            Utf8NoBom.GetBytes(testString.AsSpan(), this.writeBuffer.AsSpan());
        }

        [Benchmark]
        public void ReadString()
        {
            Utf8NoBom.GetString(this.readBuffer.AsSpan());
        }

        static void Main(string[] args)
        {
            BenchmarkRunner.Run<Benchmark>();
        }
    }

AMD system results:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=3.1.101
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-RNGYVQ : .NET Core 2.1.15 (CoreCLR 4.6.28325.01, CoreFX 4.6.28327.02), X64 RyuJIT
  Job-ILXJIL : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT


|      Method |       Runtime |     Mean |    Error |   StdDev |
|------------ |-------------- |---------:|---------:|---------:|
| WriteString | .NET Core 2.1 | 24.68 ns | 0.187 ns | 0.175 ns |
|  ReadString | .NET Core 2.1 | 23.33 ns | 0.373 ns | 0.349 ns |
| WriteString | .NET Core 3.1 | 70.51 ns | 0.787 ns | 0.736 ns |
|  ReadString | .NET Core 3.1 | 86.46 ns | 0.153 ns | 0.143 ns |

Intel system results:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.101
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-POUEZF : .NET Core 2.1.15 (CoreCLR 4.6.28325.01, CoreFX 4.6.28327.02), X64 RyuJIT
  Job-JZNDHJ : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT


|      Method |       Runtime |     Mean |    Error |   StdDev |
|------------ |-------------- |---------:|---------:|---------:|
| WriteString | .NET Core 2.1 | 34.38 ns | 0.709 ns | 1.204 ns |
|  ReadString | .NET Core 2.1 | 31.24 ns | 0.652 ns | 1.376 ns |
| WriteString | .NET Core 3.1 | 13.61 ns | 0.301 ns | 0.519 ns |
|  ReadString | .NET Core 3.1 | 29.86 ns | 0.623 ns | 1.340 ns |

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 37 (35 by maintainers)

Most upvoted comments

@jkotas I do understand your point of view, but ultimately, once the intrinsics cat is out of the bag, a world of complexity comes with it. Sure, you could postpone supporting/dealing with this to some extent, but ultimately this PDEP/BMI2 is an early/light example of what is to come in the future:

  • Cache-lines will cease being homogenous as they were since 2003 (it seems like 128 byte cache-lines are coming down the pipe)
  • If/When AVX512 is supported, developers will want to detect and be able to take advantage / provide code paths that use the 16 extra architectural registers that the processor supports.
  • There are, except for PDEP, as @tannergooding pointed out, a wide variety of quirks associated with this world of bleeding edge instructions and different models from the same manufacturer exhibit wildly different results when instructed to execute specific instructions.

It seems very arbitrary to draw a line around what sort of complex view developers “get to see” accepting partial solutions like the IsSupported boolean property, while rejecting CPUID and its likes for the general case.

As an example, Let’s assume that at some future point in time, CoreCLR will support AVX512 to some extent, if only to be able to support VNNI extensions for ML.NET or any of the 20 or so instruction families that come in future processors, simply because you will not be able to withstand the pressure to provide the associated performance/functionality that comes with it: Hell

I assume we’ll all agree (?) that AVX512 support will be exposed through an imaginary AVX512F.IsSupported property, much like AVX2 is exposed today? What would prevent some future developer from detecting such AVX512 support through AVX512F.IsSupported property, using that as a “signal” to provide a different AVX2 code-path, under the assumption that the full set of 32 ymm0-31 registers would be supported by the JIT during code-generation? This is essentially an abuse of support you would have to provide in order to expose code-paths that you never intended to support (e.g. multiple code paths per # of AVX2 registers available on a given CPU). This is essentially equivalent to exposing the actual data through CPUID in a clear and complete way allowing developers to query the CPU how many registers are supported in an explicit way.

Querying the CPU manfacturer, through CPUID, as per #785, to detect an AMD processor, or querying cache latency / line sizes / structure shouldn’t be considered as radically different than getting partial CPUID “bits” leaked out through the IsSupported flags.

I guess that my argument is, and perhaps this is also the problem with it, that you either accept your fate and the need to support these exotic instructions that come with their complexities because you wish to be able to play that game and compete with the programming language that do, or you don’t. I personally don’t see so much of a middle ground that makes sense. And I can easily see a variaty of ways where over-simplifying will ultimately fail.

Then again, I’m not the one tasked to provide and maintain this complexity, so ╮( ˘ 、 ˘ )╭

One of the key values of .NET platform is simplicity. Among other things, we hide number of OS bugs, quirks and performance issues by including workarounds for them in the platform, so that our users do not have to worry about them. These workarounds may stay in the runtime code for years, but they make the .NET platform better in the long run.

There are exceptions to every rule and System.Runtime.Intrinsics violates most of the standard simplicity rules the framework has. It is ~1500 hardware specific APIs (and that is practically doubling with the introduction of the ARM intrinsics). It requires the user to write unsafe code, to code multiple similar algorithms for the various paths to ensure good codegen is possible on all paths and that a software fallback is available.

There can already be large performance differences for a given instruction depending on what microarchitecture you are comparing; it isn’t just limited to PDEP/PEXT (which happens to be an extreme case). If you go back to pre-AVX machines; you will get severe penalties for using unaligned loads/stores. If you use non-temporal stores (even on modern hardware) for the wrong scenario, you can be looking at ~400 cycle latency rather than 1 cycle latency; but used in the right scenario (such as large memcpy), you can achieve over a 2x perf increase. If you compare “Broadwell” to the newer “Skylake-X” (https://www.agner.org/optimize/instruction_tables.pdf), there are several instructions that have gotten slower and several that have gotten faster (sometimes significantly). Developers using hardware intrinsics need to consider these scenarios; us included.

The other instructions do not seem to be that valuable, so we are not losing much.

I believe this is going to set a bad precedent for the hardware intrinsics. Especially if this happens to a larger ISA where a couple of instructions happen to behave poorly. For example, we may eventually add AVX-512 support. That support is relatively new on Intel and doesn’t exist on AMD yet, but may exist there someday. AVX-512 is split into multiple feature ISAs and exposes ranges of instructions. It may be that one of the two (Intel or AMD) has significantly worse performance on an instruction as compared to the other. Someone will likewise log a bug and then this will be looked to as an example of how it should be handled.

In my mind, the correct thing is to just reiterate that this is an extremely low level feature that allows you to fine tune performance, but as part of that, the amount of code you need to write and what considerations you need to take are also greatly expanded (it is effectively writing pseudo-inline assembly after all, just without control over register allocation and spilling).

The effect of this is nearly identical to returning false from Bmi2.IsSupported, except it is spread to more places and everybody using these intrisics needs to worry about it.

Yes, which is something developers already need to be concerned about. They already need to consider manufacturer and microarchitecture differences because what works well on “X” may not work well on “Y”. There is generally some pattern that works well on both, but that is not always the case/scenario.

It is not future proof. It will not work well once AMD fixes this performance bug.

Neither is disabling the functionality for AMD systems. We will still have the consideration of Znver1/Znver2 vs the new ZnverX that has it be faster. CPUID is meant to expose hardware specific information such as the manufacturer ID, vendor, and model numbers. It can therefore be used to limit code paths to manufacturers (IsGenuineIntel vs IsAuthenticAMD) or even specific CPU models (IsAuthenticAMD && (ExtendedModelId < value)). The difference is whether it is internalized to the JIT and prevents use of instructions on hardware that supports it or if it puts the responsibility on the developer to ensure the code path is efficient on all hardware they plan on targeting (which I believe is more inline with the principles of the feature and what developers already need to consider).

My belief is that it goes against the principles of the hardware intrinsics. They were designed to give developers the ability to access the functionality available in the underlying hardware. While one of the main focuses is performance, there is no possible way for us to “cleanly” expose only things which are always performant (as it greatly depends on the use-case etc). Even for PDEP/PEXT (on AMD); there are some cases where it takes as few as 18 cycles and some cases where it takes as many as almost 300.

Developers already have to code support for vastly different CPUs (ARM vs x86) and for different hardware levels (SSE/SSE2 as a baseline vs SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, BMI1, BMI2, LZCNT, POPCNT etc; and mixtures there-in). They also already need to consider microarchitectural differences to an extent (instructions may perform significantly worse on some older hardware; the cost of using unaligned vector loads/stores may be significantly greater or lesser; etc). Allowing them to explicitly check for the manufacturer and model/family isn’t a great jump from there and is something that has already been requested by external customers.

My suggestion for how to fix this issue would be to make Bmi2.IsSupported false on AMD64 processors

I would be strongly against this approach both in terms of how the hardware intrinsics have been exposed and as someone who has an AMD processor. Bmi2 exposes 10 APIs, only 4 of which have an issue. We shouldn’t restrict the ability to use the other 6 APIs just because 4 of them have bad perf.

We have provided no guarantees that using a given hardware intrinsic will be “performant”; only that it will emit the instruction a given API is mapped to. It has always been up to the user to utilize them efficiently (which includes pipelining, dealing with micro-architectural differences, etc). I also don’t think it will be a point in time issue. Not all customers upgrade their CPUs every 3-5 years and we’ll be in the same position for the AMD Zen1/2 CPUs even if Zen3 ends up resolving the issue. Looking at the CPUs that Azure lists (https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/), the majority are “Broadwell” which is circa 2013; just to give a rough timeline on how long I think we would need to support the workaround if implemented.

The reason why I have proposed implementing the workaround in the runtime

I’m not convinced we need a workaround. Looking at the Intel numbers with BMI2 enabled and the AMD numbers with BMI2 disabled; they are actually similar:

Intel Write: 34.38 ns to 13.61 ns
Intel Read:  31.24 ns to 29.86 ns
AMD Write:   24.48 ns to  9.83 ns
AMD Read:    24.00 ns to 23.65 ns

We should likely, first and foremost, get some numbers on how much “gain” Intel is getting via PDEP/PEXT here. My initial speculation is that they are not actually providing that great of a benefit and we would likely minimally regress Intel (while significantly boosting AMD) by just dropping their usage in the framework (at least here where it is only used “once” per call; and not in a hot loop).

The model for the hardware intrinsic is complex already. This would make it even more complex than what it is today.

I don’t believe it will be significantly more complex. We are already testing the various ISA combinations. We will continue to get full coverage of code paths even if we codify a few paths to do if (Isa.IsSupported && !Cpuid.IsAuthenticAMD) { } else { /* code path where Isa is not supported */ } (although, as per the above; I don’t believe this is required).

I’m torn on whether the runtime lying about the intrinsic being supported is the correct approach. That seems put the runtime into the position of making a value judgment. Up until now the intrinsics have been only raw capabilities checks.

That seems put the runtime into the position of making a value judgment.

Loose some non-flag altering instructions (probably not high usage); and BZHI which can be replaced by [src & (1 << inx)-1] which is the downside and doesn’t seem too high?

However between 18x and 290x times slower for PDEP and PEXT makes them completely undesirable instructions to use?

Can provide a more general purpose fallback versions in BitOperations for .NET 5.0?

CPUID and willingness to use specific code paths depending on CPUID

The model for the hardware intrinsic is complex already. This would make it even more complex than what it is today.

My suggestion for how to fix this issue would be to make Bmi2.IsSupported false on AMD64 processors. Once/if the AMD64 processors get this performance bug fixed, we can update the runtime to detect the versions with the fix and start returning true from this method again.