runtime: AVX-512 support in System.Runtime.Intrinsics.X86

I presume supporting AVX-512 intrinsics is in plan somewhere, but couldn’t find an existing issue tracking their addition. There seem to be two parts to this.

Support for EVEX encoding and use of zmm registers. I’m not entirely clear on compiler versus jit distinctions but perhaps this would allow jit to update existing 128 and 256 bit wide code using the Sse*, Avx*, or other System.Runtime.Intrinsics.X86 classes to EVEX.
Addition of Avx512 classes with the new instructions at 128, 256, and 512 bit widths.

There is some interface complexity with the (as of this writing) 17 AVX-512 subsets since Knights Landing/Mill, Skylake, Cannon Lake, Cascade Lake, Cooper Lake, and Ice/Tiger Lake all support different variations. To me, it seems most natural to deprioritize support for the Knights (they’re no longer in production, so presumably nearly all code targeting them has already been written) and implement something in the direction of

class Avx512FCD : Avx2 // minimum common set across all Intel CPUs with AVX-512
class Avx512VLDQBW : Avx512FCD // common set for enabled Skylake μarch cores and Sunny Cove

plus non-inheriting classes for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT (the remaining four subsets—4FMAPS, 4NNIW, ER, and PF—are specific to Knights). This is similar to the existing model for Bmi1, Bmi2, and Lzcnt and aligns to current hardware in a way which composes with existing inheritance and IsSupported properties. It also helps with incremental roll out.

Finding naming for code readability that’s still clear as to which instructions are available where seems somewhat tricky. Personally, I’d be content with idioms like

using Avx512 = System.Runtime.Intrinsics.X86.Avx512VLDQBW; // loose terminology

but hopefully others will have better ideas.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 21
Comments: 55 (30 by maintainers)

Most upvoted comments

We’re already working on adding AVX-512 support in .NET 8, a few foundational PRs have already been merged 😉

+10

tannergooding on Sep 26, 2022

Yes, we’ll need to continue waiting for more details. AVX-512, per specification, requires at least AVX-512F which includes the EVEX encoding, the additional 16-registers, 512-bit register support, and the kmask register support.

The “ideal” scenario is that this also includes AVX-512VL and therefore the EVEX encoding, the additional 16-registers, and the masking support are available to all 128-bit and 256-bit instructions. This would allow better optimizations for existing code paths, use of the new instructions including full width permute and vptern, and isn’t only going to light up for large inputs and HPC scenarios.

However, having official confirmation that the ISA (even if only AVX-512F) is now going to be cross-vendor does help justify work done towards supporting this.

tannergooding on Jun 10, 2022

Now that Zen4 is officialy out, I can confirm the AVX-512 supported ISAs:

This is AVX512:

F
CD
BW
DQ
IFMA
VL
VBMI
VBMI2
GFNI
VAES
VNNI
BITALG
VPCLMULQDQ
VPOPCNTDQ
BF16

That is everything except VP2INTERSECT and FP16 (the latter of which hasn’t shipped officially supported anywhere yet).

This does not include ER, PF, 4FMAPS, and 4VNNIW that were only in Knight’s Landing/Mill that were originally known as IMCI and have been effectively deprecated

tannergooding on Oct 2, 2022

Official from AMD today - support for AVX 512 in Zen 4

AMD-FAD-2022-Zen-4-Improvements-on-5nm 1

Source: https://www.servethehome.com/amd-technology-roadmap-from-amd-financial-analyst-day-2022/

HighPerfDotNet on Jun 9, 2022

The work required here can effectively be broken down into a few categories:

The first step is to update the VM to query CPUID and track the available ISAs. Then the basis of any additional work is adding support for EVEX encoded instructions but limiting it only to AVX-512VL with no masking support and no support for XMM16-XMM31. This would allow access to new 128-bit and 256-bit instructions but not access to any of the more complex functionality. It would be akin to exposing some new AVX3 ISA in complexity.

Then there are three more complex work items that could be done in any order.

Extend the register support to the additional 16 registers AVX-512 makes available. These are XMM16-XMM31 and would require work in the thread, callee, and caller save/restore contexts, work in the debugger, and some minimal work in the register allocator to indicate they are available but only on 64-bit and only when AVX-512 is supported.
Extend the register support to the upper 256-bits of the registers. This involves exposing and integrating a TYP_SIMD64 throughout the JIT as well as work in the thread, callee, and caller save/restore contexts, work in the debugger
Extend the register support to the KMASK registers. This would require more significant work in the register allocator and potentially the JIT to support the “entirely new” concept and registers.

There is then library work required to expose and support Vector512<T> and the various AVX-512 ISAs. This work could be done incrementally alongside the other work.

Conceivably the VM and basic EVEX encoding work are “any time”. It would require review from the JIT team but is not complex enough that it would be impractical to consider incrementally. The same goes for any library work exposed around this, with an annotation that API review would likely want to see more concrete numbers on the number of APIs exposed, where they are distributed, etc.

The latter three work items would touch larger amounts of JIT code however and could only be worked on if the JIT team knows they have the time and resources to review and ensure everything is working as expected. For some of these more complex work items it may even be desirable to have a small design doc laying out how its expected to work, particularly for KMASK registers.

tannergooding on May 27, 2022

I think “small inputs” are not very common for vectorized operation use cases.

It really depends on your scenarios but, in general, I tend to disagree with this statement. Intel processors, in particular, tend to impose transition latencies and downclock on wider vectors. There’s a lot of detailed cases here depending on the the exact the hardware, which vector widths you’re transitioning between, and so on. But, broadly speaking, quite a few of the compute kernels I’ve written don’t gain enough at 256 bits to offset downclocking on processors where that occurs. They therefore run faster in their 128 bit wide version. This is particularly likely on loops tend to go go hundreds of thousands to millions of iterations while completing within the few milliseconds involved in vector width transitions while the CPU requests its power supply to increase voltage. And, since upclocking after downclocking is similarly sticky, even if your loop runs faster it may not be a net win when narrower code follows.

One such case I often see is a light compute kernel maxing out DRAM bandwidth at 128 bits. Going wider (or multithreaded) on these just hits memory access harder and frequently profiles a few percent slower.

If you have a set of compute dense kernels working enough data they can pound full vector width for, say, a couple seconds without getting hung up on bottlenecks like AVX lane swaps or cache misses then, yes, there’s a decent chance it’s beneficial to maximize vector width. However, it’s not uncommon I see narrower workloads run faster due to more concurrent ALU port utilization even when downclocking isn’t an issue. I also have workloads where data to data dependencies are such that 128 accelerates nicely but 256 bit versions of the same loop run more slowly due to dependencies between different parts of the longer vectors even without downclocking. Most of those kernels don’t express any better if opened up to 512.

These are some of the reasons why my exchanges with Tanner earlier on this issue focused on AVX512VL and not so much on going 512 bits wide. From a development standpoint, typically what I do is enable hot scalar loops for 128 bit dispatch and profile that. If it’s not fast enough then I look at 256 bit and so on. Often what I see is going to VEX encoded 128 bit captures most of the benefit, frequently due to ymm availability avoiding xmm register spilling, and there’s not enough gain to going wider to justify supporting and testing the additional code paths. One thing which prompted me to open this issue, in fact, was reviewing VEX128 disassemblies and recognizing having zmm access via AVX512VL would avoid all the ymm spilling that was bogging those particular loops.

TL;DR, what Tanner said. 🙂

twest820 on Aug 23, 2021

Vectors get used internally in the framework in many locations, including things like IndexOf for strings/spans.

You can certainly check on Vector<T>.Count, but automatically using 64-byte vectors will change the perf semantics of existing code. That is, code that was vectorized for 16-64 bytes or 32-64 bytes will no longer be vectorized once Vector<T> is implicitly 512-bits.

While its true that you get the most benefit for vectors with large inputs, there are also many cases where they are beneficial for smaller inputs and where they can accelerate common inputs. Take for example names where you will commonly have 10-32 characters (20-64 bytes). AVX-512 will not commonly help here, but 128-bit vectors can (as you can drastically reduce the number of comparisons needed).

There are many more examples where this comes up in real world code and so how 512-bit support is exposed needs to be considered, including how users either opt-in or opt-out of getting implicit vectors of that size.

tannergooding on Aug 23, 2021

Vector<T> support would also be very nice.

Variable size for Vector<T> already results in unpredictable performance between AVX2 and non-AVX2 hardware due to the cost of cross-lane operations and the larger minimum vector size being useful in fewer places. Auto-extending Vector<T> to 64 bytes on AVX-512 hardware would aggravate the situation.

However, the common API defined for cross-platform vector helpers (#49397) plus Static Abstracts in Interfaces would allow the best of both worlds: shared logic where vector size doesn’t matter, plus ISA- or size-specific logic where it does.

saucecontrol on Aug 23, 2021

The gist of it is that @HighPerfDotNet was right. AMD Zen 4 has full AVX-512 support (full in the sense that it is competitive with the best Intel offerings).

I submit to you that this makes supporting AVX-512 much more compelling.

lemire on Sep 26, 2022

Zen4’s AVX512 flavors is all of Ice Lake plus AVX512-BF16.

So Zen4 has: AVX512-F AVX512-CD AVX512-VL AVX512-BW AVX512-DQ AVX512-IFMA AVX512-VBMI AVX512-VNNI AVX512-BF16 AVX512-VPOPCNTDQ AVX512-VBMI2 AVX512-VPCLMULQDQ AVX512-BITALG AVX512-GFNI AVX512-VAES

The ones it is missing are: The Xeon Phi ones: AVX512-PF, AVX512-ER, AVX512-4FMAPS, AVX512-4VNNIW AVX512-VP2INTERSECT (from Tiger Lake) AVX512-FP16 (from Sapphire Rapids and AVX512-enabled Alder Lake)

Source: https://www.mersenneforum.org/showthread.php?p=614191

filipnavara on Sep 26, 2022

The raw slides aren’t available, but its covered by the recorded and publicly available webcast: https://ir.amd.com/news-events/financial-analyst-day

Skip to 45:35 for the relevant portion and slide.

Edit: Raw slides are under the Technology Leadership link.

tannergooding on Jun 10, 2022

CC: @tannergooding

AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source:

https://videocardz.com/newz/amd-confirms-ryzen-7000-is-up-to-16-cores-and-170w-tdp-rdna2-integrated-gpu-a-standard-ai-acceleration-based-on-avx512

So that means even consumer level chip will support it, meaning it will also be in server Genoa chip due Q4 this year. This also means Intel will have to enable AVX-512 in their consumer chips too.

Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics…

HighPerfDotNet on May 27, 2022

Minor status bump: Intel’s never been particularly active on their instruction set extensions forum but they’ve recently stopped responding entirely. So no update from Intel on the questions about the arch manual and intrinsics guide that were raised here a month ago.

the documented checks is that they are distinct ISAs and an implementation is allowed to provide F without CD

Interesting. The arch manual states software must also check for F when checking for CD (and strongly recommends checking F before CD). You’ve more privileged access to what Intel really meant and context on how to resolve conflicts between the arch manual and intrinsics guide than most of us. Thanks for sharing.

twest820 on Jun 15, 2020

Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware

It is a bit more in depth than this…

ANDPD for example depends on AVX512DQ for the 512-bit variant. The 128 and 256-bit variant depend on both AVX512DQ and AVX512VL. Since this has two dependencies, it can’t be modeled using the existing inheritance hierarchy.

Now, given how VL works, it might be feasible to expose it as the following:

public abstract class AVX512F : ??
{
    public abstract class VL
    {
    }
}

public abstract class AVX512DQ : AVX512F
{
    public abstract class VL : AVX512F.VL
    {
    }
}

This would key off the existing model we have used for 64-bit extensions (e.g. Sse41.X64) and still maintains the rule that AVX512F.VL.IsSupported means AVX512F.IsSupported, etc. There are also other considerations that need to be taken into account, such as what AVX512F depends on (iirc, it is more than just AVX2 and also includes FMA, which needs to be appropriately exposed).

Actually, if I had to pick just one width for initial EVEX and new intrinsic support it’d be 128.

I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the AVX512VL extension and so the AVX512F class would need to be vetted first.

tannergooding on May 9, 2020

No, the rounding instructions convert floats to integrals, while the rounding mode impacts the returned result for x + y (for example).

IEEE 754 floating-point arithmetic is performed taking the inputs as given, computing the “infinitely precise result” and then rounding to the nearest representable result. When the “infinitely precise result” is equally close to two representable values, you need a tie breaker to determine which to choose. The default tie breaker is “ToEven”, but you can (not .NET, but in other languages or in hardware) set the rounding mode to do something like AwayFromZero, ToZero, ToPositiveInfinity, or ToNegativeInfinity instead. EVEX supports doing this on a per operation basis without having to use explicit instructions to modify and restore the floating-point control state

tannergooding on May 5, 2020