wasmtime: Vector register endianness causing ABI incompatibility
Currently, s390x is the only big-endian platform supported by cranelift, which has already caused problems related to memory accesses in the past, since wasmtime defines all data in its memory to be in little-endian byte order.
I had assumed that register contents are not affected by byte order, and while this is true for scalar types, it turns out this is not actually the case for vector registers. This is because of a combination of two issues:
- There are IR instructions that directly refer to vector register lanes by number; and
- There are IR instructions to re-interpret a vector register as any other vector type, including one with different lane numbers
The combination of these two makes ISA byte order properties of vector registers visible to IR code, just like byte order in memory is visible via a combination of using memory addresses to access (parts of) a value in memory in a different type than it was stored.
Specifically, in the Intel (or ARM) little-endian ISA, if you load e.g. a I32X4 into a vector register, use raw_bitcast to re-interpret that register as I8X16, and retrieve lane 0 of the resulting value via extractlane, the result will be the least-significant byte of the I32 in lane 0 of the original value. And in fact, wasmtime requires this to be the case.
On the other hand, on s390x the content of a vector register in I32X4 mode will be a sequence of four I32, each in big-endian byte order (or else the arithmetic operations on the register will give wrong results). Therefore, the same sequence of operations as above will return the most-significant byte of the I32 in lane 0. This actually caused many wasmtime tests to fail.
To fix this, my current implementation of SIMD instructions on s390x uses a trick combining two different aspects:
- When loading the
I32X4from little-endian memory into a vector register, I’m byte-reversing all 16 bytes of the loaded value. This not only fixes eachI32value to be in the correct big-endian order in the register so subsequent arithmetic will work, it also implicitly swaps the order of elements, i.e. the element in slot 0 in memory will end up in what the ISA considers slot 3 of the register etc. - The implementation of all IR instructions that uses explicit lane numbers will be aware of this renumbering, and implicitly revert it to get back to the lanes the code intends to access, so e.g. using
extractlanefor lane 0 of aI32X4will actually at the ISA level extract lane 3 of the register.
The combination of these two aspects makes accessing SIMD registers work correctly for wasmtime. For example, in the above case, accessing lane 0 of a I8X16 is converted to lane 15 of the register, which holds the least-significant byte of the I32 in lane 3 of the register, which was loaded from lane 0 in memory – so in the end we return the least-significant byte of the I32 in lane 0 of the original value, as expected by wasmtime.
However, in implementing support for rustc_codegen_cranelift, I noticed that the current implementation actually breaks when handling big-endian vectors in memory - this will be the case for rustc, since that uses native platform byte order everywhere. Specifically, when loading a big-endian vector from memory, I’m just loading the bytes unchanged. This means that e.g. lane 0 of a I32X4 in memory ends up in its original byte order (which is OK since this is already big-endian) in lane 0 of the register - but the latter is a problem if subsequent code wants to extract that lane, since an extractlane 0 will actually access lane 3 in the register as described above!
To work around that problem, I’ve implemented a patch that will perform big-endian vector loads by swapping the order of the elements, but not the byte order within any single element. This will cause lane 0 from memory to end up in lane 3 in the register, and makes extractlane work as expected again.
With that fix, now both wasmtime and rustc_codegen_cranelift pass all their existing SIMD tests. Unfortunately, I think this is still not quite a complete solution.
Specifically, we can now run into two additional problems with big-endian code, which apparently are just never triggered by the existing rustc_codegen_cranelift tests.
First, I guess it would be possible to re-interpret contents in a vector register in another type even in rustc. Now, as opposed to wasmtime, rustc uses native endianness, presumably also w.r.t. vector contents. Therefore, the semantics of such a re-interpretation would be platform-defined and differ between big- and little-endian platforms (which is probably why it’s not frequently used). However, users would expect this platform-specific behavior to be the same between the LLVM and cranelift back ends to rustc - which in the current implementation it would not be.
Even more problematic, carrying vector elements in reverse order in vector registers actually affects the ABI, as vector types are passed in vector registers. Code compiled by rustc using the LLVM back end would expect those to be in the “normal” platform order, while code compiled by rustc using the cranelift back end would expect them to be the “reverse” order.
One option I’m thinking of would be to actually implement both methods in the cranelift back end. Specifically, the back end could support both a “vector big-endian” and “vector little-endian” mode, where the “big-endian” mode would use lane numbers directly as defined by our ISA, while the “little-endian” mode would use the reverse ordering implemented by the current back end code.
There’s a slight complication in that we might have to support both big- and little-endian vector modes in load and store operations accessing any combination of big- and little-endian memory locations. But that should be possible:
vector mode memory byte order load/store operation
big-endian big-endian direct
big-endian little-endian byte-reverse each element
little-endian big-endian reverse order of elements
little-endian little-endian byte-reverse 16-byte input
(Starting with z15 we actually can implement each of these operations using a single instruction, so that should also be efficient.)
The one remaining question is, how to select between the “vector big-endian” and “vector little-endian” modes? There are no attributes (like MemFlags) on the extractlane etc. operations, and that wouldn’t even make sense: this is a global property, if you used big-endian mode to load the vector you must also use big-endian mode on all subsequent operations.
So I guess this would have to be some sort of global flag, which wasmtime would set to always little-endian, and rustc would leave at native byte order. Of course, this flag is actually ABI changing, so the same setting must be used for code to remain link-compatible. But I think given the above use cases, that should actually be fine.
FYI @cfallin - this is another problem I ran into while enabling rustc_codegen_cranelift - I’d appreciate any comments or suggestions!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (18 by maintainers)
Commits related to this issue
- Add two items to Cranelift agenda for Aug 8. - Discussion of vector endianness (bytecodealliance/wasmtime#4566) - Notification of plans to factor out part of Cranelift's base layer to share with Wi... — committed to cfallin/meetings by cfallin 2 years ago
- s390x: Support both big- and little-endian vector lane order This implements the s390x back-end portion of the solution for https://github.com/bytecodealliance/wasmtime/issues/4566 We now support bo... — committed to uweigand/wasmtime by uweigand 2 years ago
- s390x: Support both big- and little-endian vector lane order This implements the s390x back-end portion of the solution for https://github.com/bytecodealliance/wasmtime/issues/4566 We now support bo... — committed to uweigand/wasmtime by uweigand 2 years ago
- s390x: Support both big- and little-endian vector lane order This implements the s390x back-end portion of the solution for https://github.com/bytecodealliance/wasmtime/issues/4566 We now support bo... — committed to uweigand/wasmtime by uweigand 2 years ago
- s390x: Support both big- and little-endian vector lane order (#4682) This implements the s390x back-end portion of the solution for https://github.com/bytecodealliance/wasmtime/issues/4566 We now... — committed to bytecodealliance/wasmtime by uweigand 2 years ago
- Merge raw_bitcast and bitcast - Allow bitcast for vectors with differing lane widths - Remove raw_bitcast IR instruction - Change all users of raw_bitcast to bitcast - Implement support for no-op bit... — committed to uweigand/wasmtime by uweigand 2 years ago
- Merge raw_bitcast and bitcast - Allow bitcast for vectors with differing lane widths - Remove raw_bitcast IR instruction - Change all users of raw_bitcast to bitcast - Implement support for no-op bit... — committed to uweigand/wasmtime by uweigand 2 years ago
- Merge raw_bitcast and bitcast (#5175) - Allow bitcast for vectors with differing lane widths - Remove raw_bitcast IR instruction - Change all users of raw_bitcast to bitcast - Implement support fo... — committed to bytecodealliance/wasmtime by uweigand 2 years ago
- Support big- and little-endian lane order with bitcast Add a MemFlags operand to the bitcast instruction, where only the `big` and `little` flags are accepted. These define the lane order to be used... — committed to uweigand/wasmtime by uweigand 2 years ago
- Support big- and little-endian lane order with bitcast Add a MemFlags operand to the bitcast instruction, where only the `big` and `little` flags are accepted. These define the lane order to be used... — committed to uweigand/wasmtime by uweigand 2 years ago
- Support big- and little-endian lane order with bitcast Add a MemFlags operand to the bitcast instruction, where only the `big` and `little` flags are accepted. These define the lane order to be used... — committed to uweigand/wasmtime by uweigand 2 years ago
Weighing in a bit late here, sorry (and thank you @uweigand for the detailed analysis as always!) – I think I’m converging toward the following thoughts:
Indeed, booleans are best addressed by removing them altogether. The
BNxMvector-bool types andBnwide-bool types forn > 1never sat well with me: the intent is clearly to have some sort of mask value, but these types are n-bit containers where all but two of the2^npossible values are illegal. As we explored in #3205 (an issue which just celebrated its first birthday – clearly this stuff is not easy), that implies either some sort of validation – on load? on conversion into a wide bool type? – or undefined behavior in some cases.The bitmask-like cases for wide bools can IMHO be satisfied by integer-typed SIMD values, and the cases were we need to communicate a value from an
icmpto a “bool” consumer, likeselect, can pattern-match the(select (icmp ...))as they do today and work from a reified zero/nonzero value otherwise (any nonzero -> true).I also suspect that our support for bool types is patchy at best – there are lowerings in some places tested by tests, but bools have never been as well-exercised as ints. Simplifying our set of types gets us closer to full coverage in one big step.
If we remove boolean types altogether, then we no longer need the distinction between the two bitcasts. At that point I favor calling it
bitcast; theraw_is somewhat confusing (isn’t a bit-level pun operation at some level “raw” no matter what?).Adding a
MemFlagsto thebitcastinstruction is a pretty novel idea, and almost follows from the “store-then-load” equivalent sequence, I think, except that the underlying store and load could have different flags. Similarly some of the flags wouldn’t matter (e.g.,notrapandaligned– the operation would be defined in terms of a hypothetical memory location that is always valid and aligned), and I generally prefer not to have representable variants that don’t have semantic meaning. Instead could we make endianness part of the cast? This I think subsumes @uweigand’s proposed ops above, with the LE / BE variants. The endianness is irrelevant for scalar casts in practice (and let’s say by definition if we want: I don’t think there are machines that use one endianness for I64 and another for F64?) but matters for vector lane order (see rest of this issue!).If this seems reasonable, then I think the order of operations is (i) remove bools, (ii) remove
raw_bitcastand move vector use-cases over tobitcast, merging the lowerings for the different cases in the backends, then (iii) add the notion of endianness tobitcast.Thoughts?
Hi @cfallin for reference here’s the set of charts I shared in today’s meetings: Vector-Endian.pdf
Thanks for looking at this, @cfallin ! Happy to have a discussion in the meeting - I’d be available next Monday (Aug 8).