memchr: `x86` performance regression `2.5.0` -> `2.6.0`
Unfortunately, I’m not able to provide many details, but when upgrading from 2.5 to 2.6, my library, which does large file parsing via many find calls, saw a small performance regression. The flamegraph isn’t entirely helpful because there appears to be different amounts of inlining and some of the functions have changed. If the core x86 Finder::find has not changed then this could just be compilation differences.
Speed measurements here are relative to the 2.5 version, so that is why you see an error bar centered at zero. The general processing speed is between 300 and 500 megabytes per second for scale. Measurements were taken on a system using one core at a time, so no multi-threading. Three measurements were taken for each run.
Here is one example of a perf diff:
❯ perf diff 25.data 26.data
+7.11% parser [.] memchr::arch::x86_64::avx2::memchr::One::find_raw_avx2
+1.27% parser [.] memchr::arch::x86_64::memchr::memchr_raw::find_avx2
+1.18% parser [.] memchr::memmem::searcher::searcher_kind_one_byte
+1.06% parser [.] memchr::arch::x86_64::avx2::memchr::One::find_raw
8.36% +0.82% libc-2.26.so [.] __memcmp_sse4_1
+0.52% parser [.] memchr::arch::x86_64::avx2::packedpair::Finder::find_impl
and another for the large gap labeled G above:
❯ perf diff 25.data 26.data
+78.62% parser [.] memchr::arch::x86_64::avx2::packedpair::Finder::find_impl
+2.65% parser [.] memchr::arch::x86_64::avx2::memchr::One::find_raw_avx2
13.27% -0.29% libc-2.26.so [.] __memcpy_ssse3
+0.11% parser [.] memchr::memmem::searcher::searcher_kind_avx2
78.07% parser [.] memchr::memmem::x86::avx::std::Forward::find_impl
2.83% parser [.] memchr::memchr::x86::avx::memchr
0.24% parser [.] memchr::memmem::Finder::find
The scenario that G is in is that it’s searching a huge number of bytes and never finding the needle in the haystack. So that is why the performance progression is so much worse for this test scenario. Note that this is also using a 5 byte Finder/needle
Any advice in trying to track down or recreate this with a benchmark? I know that this report is somewhat unhelpful in diagnosing the actual issue, but I felt it better to at least post what information I can so that it’s on the radar.
System info
Ec2 c6a.12xlarge
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7R13 Processor
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 16 (9 by maintainers)
Commits related to this issue
- arch: simplify and improve is_equal_raw This came from @jhorstmann in #139[1]. It simplifies `is_equal_raw` and also in turn simplifies its codegen. Since this routine gets inlined into others, this ... — committed to BurntSushi/memchr by BurntSushi 6 months ago
- arch: simplify and improve is_equal_raw This came from @jhorstmann in #139[1]. It simplifies `is_equal_raw` and also in turn simplifies its codegen. Since this routine gets inlined into others, this ... — committed to BurntSushi/memchr by BurntSushi 6 months ago
- arch: simplify and improve is_equal_raw This came from @jhorstmann in #139[1]. It simplifies `is_equal_raw` and also in turn simplifies its codegen. Since this routine gets inlined into others, this ... — committed to BurntSushi/memchr by BurntSushi 6 months ago
The following disgusting patch gets rid of the extra
movfor me. Perhaps it can inspire a non-disgusting patch. The idea is rather than keeping{cur, index1, index2}all live at once, pre-compute(cur1, cur2) = (cur.add(index1), cur.add(index2))and then you can forget about the indices.For anyone who might be willing to help, I’ve created a bit more of a refined reproduction of the issue that shows the codegen problem here: https://github.com/BurntSushi/memchr-2.6-mov-regression
Any help would be appreciated!
This seems to be a register pressure issue. For me, LLVM has hoisted this load of
index2, but spills it to the stack as0x10(%rsp)instead of keeping it in a register: https://github.com/BurntSushi/memchr/blob/2.6.4/src/arch/generic/packedpair.rs#L237Thanks! I’ll take a look later.
For Assembly, the best way in my experience is to use
perfon Linux. It should be pretty clear where the hotspot is in functions likememchr::arch::x86_64::avx2::packedpair::Finder::find_impl. Look for the SIMD instructions starting withv. That should be where most of the time is being spent. Then compare those with the instructions used inmemchr 2.5.0.But I’ll take a look now that I have a repro. Just not sure when. Thank you!