stdarch: Performance regressions porting Jetscii from inline assembly to intrinsics
I ported Jetscii to use stdsimd with the belief that it will be stabilized sooner š.
Thereās a stdsimd branch in case you are interested in following along at home.
The initial port is roughly 60% of the original speed:
name inline-asm ns/iter intrinsics ns/iter diff ns/iter diff % speedup
bench::space_asciichars 1,023,795 (5121 MB/s) 1,643,905 (3189 MB/s) 620,110 60.57% x 0.62
bench::space_asciichars_as_pattern 1,044,517 (5019 MB/s) 1,716,374 (3054 MB/s) 671,857 64.32% x 0.61
bench::space_asciichars_macro 993,105 (5279 MB/s) 1,658,466 (3161 MB/s) 665,361 67.00% x 0.60
bench::space_find_byte 3,610,758 (1452 MB/s) 3,526,808 (1486 MB/s) -83,950 -2.32% x 1.02
bench::space_find_char 633,608 (8274 MB/s) 636,607 (8235 MB/s) 2,999 0.47% x 1.00
bench::space_find_char_set 10,600,525 (494 MB/s) 10,561,106 (496 MB/s) -39,419 -0.37% x 1.00
bench::space_find_closure 10,156,759 (516 MB/s) 10,072,882 (520 MB/s) -83,877 -0.83% x 1.01
bench::space_find_string 7,506,830 (698 MB/s) 7,507,111 (698 MB/s) 281 0.00% x 1.00
bench::substring_as_pattern 1,082,652 (4842 MB/s) 1,496,699 (3502 MB/s) 414,047 38.24% x 0.72
bench::substring_find 1,670,638 (3138 MB/s) 1,687,034 (3107 MB/s) 16,396 0.98% x 0.99
bench::substring_with_cached_searcher 997,570 (5255 MB/s) 1,520,424 (3448 MB/s) 522,854 52.41% x 0.66
bench::substring_with_created_searcher 1,007,291 (5204 MB/s) 1,533,745 (3418 MB/s) 526,454 52.26% x 0.66
bench::xml_delim_3_asciichars 1,014,110 (5169 MB/s) 1,637,181 (3202 MB/s) 623,071 61.44% x 0.62
bench::xml_delim_3_asciichars_as_pattern 984,594 (5324 MB/s) 1,628,740 (3218 MB/s) 644,146 65.42% x 0.60
bench::xml_delim_3_asciichars_macro 1,023,173 (5124 MB/s) 1,623,991 (3228 MB/s) 600,818 58.72% x 0.63
bench::xml_delim_3_find_byte_closure 2,237,287 (2343 MB/s) 2,211,426 (2370 MB/s) -25,861 -1.16% x 1.01
bench::xml_delim_3_find_char_closure 14,359,362 (365 MB/s) 14,204,971 (369 MB/s) -154,391 -1.08% x 1.01
bench::xml_delim_3_find_char_set 17,588,694 (298 MB/s) 17,769,736 (295 MB/s) 181,042 1.03% x 0.99
bench::xml_delim_5_asciichars 1,032,586 (5077 MB/s) 1,790,343 (2928 MB/s) 757,757 73.38% x 0.58
bench::xml_delim_5_asciichars_as_pattern 1,034,084 (5070 MB/s) 1,612,350 (3251 MB/s) 578,266 55.92% x 0.64
bench::xml_delim_5_asciichars_macro 986,644 (5313 MB/s) 1,666,725 (3145 MB/s) 680,081 68.93% x 0.59
bench::xml_delim_5_find_byte_closure 2,257,573 (2322 MB/s) 2,408,606 (2176 MB/s) 151,033 6.69% x 0.94
bench::xml_delim_5_find_char_closure 8,009,474 (654 MB/s) 7,453,402 (703 MB/s) -556,072 -6.94% x 1.07
bench::xml_delim_5_find_char_set 23,184,513 (226 MB/s) 23,272,996 (225 MB/s) 88,483 0.38% x 1.00
Takeaways
- Make sure to use
#[target_feature](and/or-C target-feature)?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 30 (30 by maintainers)
To be clear, Iām not complaining about any extra code I have to write. Please note that Jetscii was written 3 years ago using inline assembly and the entire reason that I wrote the Cupid crate was to work towards having runtime detection, I just⦠never got around to it šø . I am only attempting to provide feedback based on porting the algorithms expressed within to stdsimd.
@BurntSushi
This seemed very worrisome to me at first (bad LLVM!) but looking more into this I think thatās ok. Iād guess that whatās happening here is that LLVM is inlining methods without
ssse3into thessse3method very aggressively. At that point it calls a bunch of the intrinsics tagged withssse3and itās safely inlined.I suspect that if you didnāt tag the top method with
ssse3(or if LLVM didnāt inline in a loop) then none of the intrinsic calls would get inlined.This actually seems like it could be a really cool trick one dayā¦
Somehow you could probably assert to the compiler that the
enableis safe andSSSE3VectorBuilderis alone proof that theunsafeon the function isnāt needed. That way with a trick like this you could rely on LLVMās inliner and only rarely use#[target_feature(enable = ...)]@gnzlbg
Iād probably shy away from it for now, but I could see it being a possibility!
In a magical world, Iād love it if
#[cfg(target_feature = "X")]andif is_*_feature_detected!("X")automatically made calling atarget_featurefunction safe.unsafecould be āThe Big Hammerā when thatās not enough.Oh yes, indeed! I experimented with that when I was writing the code to check my understanding, and that was indeed the case. Performance regressed and none of the intrinsics were inlined (as expected).
@gnzlbg I basically agree with you about intent. I just happen to come down on the side of āIād rather obscure intent a little in favor of isolating
unsafe.ā And yes, I would definitely appreciate a compiler error if a function failed to inline. šBasically all these methods are inside the same crate and LLVM is just doing inlining as usual. Inlining functions into other functions that extend their target feature set is ok, so inlining non-ssse3 functions into an ssse3 functions is perfectly fine.
Problems can only arise when LLVM does not inline an intermediary function that does not have the ssse3 target feature attribute. In that case, ssse3 functions wonāt be inlineable into it either.
I personally think that code should express intent. In this case, your intent is clearly for all that code to be compiled with
ssse3enabled independently of how good or bad LLVM is at inlining. The way to express that in Rust is to mark those function with the#[target_feature]attribute, therefore I think that those functions should have it.This requires you to make these functions
unsafe fnwhich is unnecessary in this case because it is also your intent to only call most of those functions from otherssse3functions, which is safe.Currently there is no way to express this intent in Rust, but I think this is a major ergonomic issue with the current
std::archintrinsics thatās worth solving. Also, you probably would prefer a compiler error here instead of a performance bug because inlining somewhere failed.The key to how I do things is this type:
The only way for a consumer outside of this module to get a value with type
SSSE3VectorBuilderis to call its constructor:And in turn, the only way for this constructor to return a non-
Nonevalue is if thessse3target feature is enabled. What this means is that this type acts a āreceiptā of sorts that, if you have it in hand, then you know for a fact that the right CPU target features are enabled.In that same module, I defined my own vector type (using a macro so that things work on Rust 1.12):
In particular, the only way for a consumer to get a
u8x16is by first constructing aSSSE3VectorBuilder. So the implication above holds: if you have au8x16from this module, then you know that SSSE3 is enabled. This in turn lets me define safe methods onu8x16for use in higher level code. For example:This might seem like a lot of ceremony, but these vectors are used in a fairly complex SIMD algorithm called Teddy, and this was the only way I could figure out how to āisolateā the
unsafe. The key here is to realize that your only responsibility in determining whether an intrinsic is safe to call or not is whether the underlying CPU is capable of executing it or not. Therefore, we use the type system to represent this state. The alternative here would be that the entirety of Teddy would beunsafe. Despite @gnzlbgās insistent that the amount ofunsafedoesnāt matter much, I actually think it does, quite a bit, and I really wanted to minimize the amount ofunsafe.I donāt mean to suggest you should adopt this approach for jetscii unless you feel like it will work well for you, but rather, to plant the seed of using the type system in some way to control
unsafe.One potential downside to all of this is that it can be easy to introduce performance bugs, and to some extent, itās not clear whether Iām relying on LLVM bugs to get performance correct here. In particular, I only use
#[target_feature(enable = "ssse3")]on a single top-level method in the Teddy implementation. All of the intrinsics are defined using that as well, but I have a bunch of intermediate methods that donāt have atarget_featureattribute. Even so, LLVM seems happy to inline all of them, even when using AVX vectors.(To be clear, compile time flags arenāt really an option to me. IMO, people should be focusing on using runtime detection as much as possible, to make their optimizations apply with the lowest possible friction.)