stdarch: Performance regressions porting Jetscii from inline assembly to intrinsics

I ported Jetscii to use stdsimd with the belief that it will be stabilized sooner 😜.

There’s a stdsimd branch in case you are interested in following along at home.

The initial port is roughly 60% of the original speed:

 name                                      inline-asm ns/iter     intrinsics ns/iter     diff ns/iter  diff %  speedup
 bench::space_asciichars                   1,023,795 (5121 MB/s)  1,643,905 (3189 MB/s)       620,110  60.57%   x 0.62
 bench::space_asciichars_as_pattern        1,044,517 (5019 MB/s)  1,716,374 (3054 MB/s)       671,857  64.32%   x 0.61
 bench::space_asciichars_macro             993,105 (5279 MB/s)    1,658,466 (3161 MB/s)       665,361  67.00%   x 0.60
 bench::space_find_byte                    3,610,758 (1452 MB/s)  3,526,808 (1486 MB/s)       -83,950  -2.32%   x 1.02
 bench::space_find_char                    633,608 (8274 MB/s)    636,607 (8235 MB/s)           2,999   0.47%   x 1.00
 bench::space_find_char_set                10,600,525 (494 MB/s)  10,561,106 (496 MB/s)       -39,419  -0.37%   x 1.00
 bench::space_find_closure                 10,156,759 (516 MB/s)  10,072,882 (520 MB/s)       -83,877  -0.83%   x 1.01
 bench::space_find_string                  7,506,830 (698 MB/s)   7,507,111 (698 MB/s)            281   0.00%   x 1.00
 bench::substring_as_pattern               1,082,652 (4842 MB/s)  1,496,699 (3502 MB/s)       414,047  38.24%   x 0.72
 bench::substring_find                     1,670,638 (3138 MB/s)  1,687,034 (3107 MB/s)        16,396   0.98%   x 0.99
 bench::substring_with_cached_searcher     997,570 (5255 MB/s)    1,520,424 (3448 MB/s)       522,854  52.41%   x 0.66
 bench::substring_with_created_searcher    1,007,291 (5204 MB/s)  1,533,745 (3418 MB/s)       526,454  52.26%   x 0.66
 bench::xml_delim_3_asciichars             1,014,110 (5169 MB/s)  1,637,181 (3202 MB/s)       623,071  61.44%   x 0.62
 bench::xml_delim_3_asciichars_as_pattern  984,594 (5324 MB/s)    1,628,740 (3218 MB/s)       644,146  65.42%   x 0.60
 bench::xml_delim_3_asciichars_macro       1,023,173 (5124 MB/s)  1,623,991 (3228 MB/s)       600,818  58.72%   x 0.63
 bench::xml_delim_3_find_byte_closure      2,237,287 (2343 MB/s)  2,211,426 (2370 MB/s)       -25,861  -1.16%   x 1.01
 bench::xml_delim_3_find_char_closure      14,359,362 (365 MB/s)  14,204,971 (369 MB/s)      -154,391  -1.08%   x 1.01
 bench::xml_delim_3_find_char_set          17,588,694 (298 MB/s)  17,769,736 (295 MB/s)       181,042   1.03%   x 0.99
 bench::xml_delim_5_asciichars             1,032,586 (5077 MB/s)  1,790,343 (2928 MB/s)       757,757  73.38%   x 0.58
 bench::xml_delim_5_asciichars_as_pattern  1,034,084 (5070 MB/s)  1,612,350 (3251 MB/s)       578,266  55.92%   x 0.64
 bench::xml_delim_5_asciichars_macro       986,644 (5313 MB/s)    1,666,725 (3145 MB/s)       680,081  68.93%   x 0.59
 bench::xml_delim_5_find_byte_closure      2,257,573 (2322 MB/s)  2,408,606 (2176 MB/s)       151,033   6.69%   x 0.94
 bench::xml_delim_5_find_char_closure      8,009,474 (654 MB/s)   7,453,402 (703 MB/s)       -556,072  -6.94%   x 1.07
 bench::xml_delim_5_find_char_set          23,184,513 (226 MB/s)  23,272,996 (225 MB/s)        88,483   0.38%   x 1.00

Takeaways

Make sure to use #[target_feature] (and/or -C target-feature)?

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 30 (30 by maintainers)

Most upvoted comments

if you run jetscii on a x86/x86_64 CPU without SSE4.2 that leads to undefined behavior

and choose different algorithms depending on the answer at run-time then well… that’s just going to require more changes to the library

I thought @shepmaster was doing a CPU ID check before though

To be clear, I’m not complaining about any extra code I have to write. Please note that Jetscii was written 3 years ago using inline assembly and the entire reason that I wrote the Cupid crate was to work towards having runtime detection, I just… never got around to it 😸 . I am only attempting to provide feedback based on porting the algorithms expressed within to stdsimd.

shepmaster on Mar 27, 2018

@BurntSushi

Even so, LLVM seems happy to inline all of them, even when using AVX vectors.

This seemed very worrisome to me at first (bad LLVM!) but looking more into this I think that’s ok. I’d guess that what’s happening here is that LLVM is inlining methods without ssse3 into the ssse3 method very aggressively. At that point it calls a bunch of the intrinsics tagged with ssse3 and it’s safely inlined.

I suspect that if you didn’t tag the top method with ssse3 (or if LLVM didn’t inline in a loop) then none of the intrinsic calls would get inlined.

This actually seems like it could be a really cool trick one day…

impl SSSE3VectorBuilder {
    #[target_feature(enable = "ssse3")]
    pub fn run<F, R>(&self, f: F) -> R 
        where F: FnOnce() -> R
    { f() }

Somehow you could probably assert to the compiler that the enable is safe and SSSE3VectorBuilder is alone proof that the unsafe on the function isn’t needed. That way with a trick like this you could rely on LLVM’s inliner and only rarely use #[target_feature(enable = ...)]

@gnzlbg

(@alexcrichton thoughts? this would be useful in stdsimd)

I’d probably shy away from it for now, but I could see it being a possibility!

alexcrichton on Mar 27, 2018

allow calling these “safe” target feature functions from functions that do not have the same target feature by using an unsafe { } block:

In a magical world, I’d love it if #[cfg(target_feature = "X")] and if is_*_feature_detected!("X") automatically made calling a target_feature function safe. unsafe could be “The Big Hammer” when that’s not enough.

shepmaster on Mar 27, 2018

I suspect that if you didn’t tag the top method with ssse3 (or if LLVM didn’t inline in a loop) then none of the intrinsic calls would get inlined.

Oh yes, indeed! I experimented with that when I was writing the code to check my understanding, and that was indeed the case. Performance regressed and none of the intrinsics were inlined (as expected).

@gnzlbg I basically agree with you about intent. I just happen to come down on the side of “I’d rather obscure intent a little in favor of isolating unsafe.” And yes, I would definitely appreciate a compiler error if a function failed to inline. 😃

BurntSushi on Mar 27, 2018

Today, it is working as I intend it to. Is that a bug? Or is LLVM just really smart and actually Doing The Right Thing? All the call sites are actually sandwiched between compatible target_feature attributes, so it does seem like to me that inlining everything is a valid thing to do?

Basically all these methods are inside the same crate and LLVM is just doing inlining as usual. Inlining functions into other functions that extend their target feature set is ok, so inlining non-ssse3 functions into an ssse3 functions is perfectly fine.

Problems can only arise when LLVM does not inline an intermediary function that does not have the ssse3 target feature attribute. In that case, ssse3 functions won’t be inlineable into it either.

I personally think that code should express intent. In this case, your intent is clearly for all that code to be compiled with ssse3 enabled independently of how good or bad LLVM is at inlining. The way to express that in Rust is to mark those function with the #[target_feature] attribute, therefore I think that those functions should have it.

This requires you to make these functions unsafe fn which is unnecessary in this case because it is also your intent to only call most of those functions from other ssse3 functions, which is safe.

Currently there is no way to express this intent in Rust, but I think this is a major ergonomic issue with the current std::arch intrinsics that’s worth solving. Also, you probably would prefer a compiler error here instead of a performance bug because inlining somewhere failed.

gnzlbg on Mar 27, 2018

For jetscii this is hard because its algorithms are stateful so maybe @BurntSushi can chim in and explain how teddy does it.

The key to how I do things is this type:

/// A builder for SSSE3 empowered vectors.
///
/// This builder represents a receipt that the SSSE3 target feature is enabled
/// on the currently running CPU. Namely, the only way to get a value of this
/// type is if the SSSE3 feature is enabled.
///
/// This type can then be used to build vector types that use SSSE3 features
/// safely.
#[derive(Clone, Copy, Debug)]
pub struct SSSE3VectorBuilder(());

The only way for a consumer outside of this module to get a value with type SSSE3VectorBuilder is to call its constructor:

    /// Create a new SSSE3 vector builder.
    ///
    /// If the SSSE3 feature is not enabled for the current target, then
    /// return `None`.
    pub fn new() -> Option<SSSE3VectorBuilder> {
        if is_x86_feature_detected!("ssse3") {
            Some(SSSE3VectorBuilder(()))
        } else {
            None
        }
    }

And in turn, the only way for this constructor to return a non-None value is if the ssse3 target feature is enabled. What this means is that this type acts a “receipt” of sorts that, if you have it in hand, then you know for a fact that the right CPU target features are enabled.

In that same module, I defined my own vector type (using a macro so that things work on Rust 1.12):

// We define our union with a macro so that our code continues to compile on
// Rust 1.12.
macro_rules! defunion {
    () => {
        /// A u8x16 is a 128-bit vector with 16 single-byte lanes.
        ///
        /// It provides a safe API that uses only SSE2 or SSSE3 instructions.
        /// The only way for callers to construct a value of this type is
        /// through the SSSE3VectorBuilder type, and the only way to get a
        /// SSSE3VectorBuilder is if the `ssse3` target feature is enabled.
        ///
        /// Note that generally speaking, all uses of this type should get
        /// inlined, otherwise you probably have a performance bug.
        #[derive(Clone, Copy)]
        #[allow(non_camel_case_types)]
        pub union u8x16 {
            vector: __m128i,
            bytes: [u8; 16],
        }
    }
}

defunion!();

In particular, the only way for a consumer to get a u8x16 is by first constructing a SSSE3VectorBuilder. So the implication above holds: if you have a u8x16 from this module, then you know that SSSE3 is enabled. This in turn lets me define safe methods on u8x16 for use in higher level code. For example:

    #[inline]
    pub fn ne(self, other: u8x16) -> u8x16 {
        // Safe because we know SSSE3 is enabled.
        unsafe {
            let boolv = _mm_cmpeq_epi8(self.vector, other.vector);
            let ones = _mm_set1_epi8(0xFF as u8 as i8);
            u8x16 { vector: _mm_andnot_si128(boolv, ones) }
        }
    }

This might seem like a lot of ceremony, but these vectors are used in a fairly complex SIMD algorithm called Teddy, and this was the only way I could figure out how to “isolate” the unsafe. The key here is to realize that your only responsibility in determining whether an intrinsic is safe to call or not is whether the underlying CPU is capable of executing it or not. Therefore, we use the type system to represent this state. The alternative here would be that the entirety of Teddy would be unsafe. Despite @gnzlbg’s insistent that the amount of unsafe doesn’t matter much, I actually think it does, quite a bit, and I really wanted to minimize the amount of unsafe.

I don’t mean to suggest you should adopt this approach for jetscii unless you feel like it will work well for you, but rather, to plant the seed of using the type system in some way to control unsafe.

One potential downside to all of this is that it can be easy to introduce performance bugs, and to some extent, it’s not clear whether I’m relying on LLVM bugs to get performance correct here. In particular, I only use #[target_feature(enable = "ssse3")] on a single top-level method in the Teddy implementation. All of the intrinsics are defined using that as well, but I have a bunch of intermediate methods that don’t have a target_feature attribute. Even so, LLVM seems happy to inline all of them, even when using AVX vectors.

(To be clear, compile time flags aren’t really an option to me. IMO, people should be focusing on using runtime detection as much as possible, to make their optimizations apply with the lowest possible friction.)

BurntSushi on Mar 27, 2018