memchr: `x86` performance regression `2.5.0` -> `2.6.0`

Unfortunately, I’m not able to provide many details, but when upgrading from 2.5 to 2.6, my library, which does large file parsing via many find calls, saw a small performance regression. The flamegraph isn’t entirely helpful because there appears to be different amounts of inlining and some of the functions have changed. If the core x86 Finder::find has not changed then this could just be compilation differences.

Speed measurements here are relative to the 2.5 version, so that is why you see an error bar centered at zero. The general processing speed is between 300 and 500 megabytes per second for scale. Measurements were taken on a system using one core at a time, so no multi-threading. Three measurements were taken for each run.

image image

Here is one example of a perf diff:

❯ perf diff 25.data 26.data

               +7.11%  parser                [.] memchr::arch::x86_64::avx2::memchr::One::find_raw_avx2
               +1.27%  parser                [.] memchr::arch::x86_64::memchr::memchr_raw::find_avx2
               +1.18%  parser                [.] memchr::memmem::searcher::searcher_kind_one_byte
               +1.06%  parser                [.] memchr::arch::x86_64::avx2::memchr::One::find_raw
     8.36%     +0.82%  libc-2.26.so        [.] __memcmp_sse4_1
               +0.52%  parser                [.] memchr::arch::x86_64::avx2::packedpair::Finder::find_impl

and another for the large gap labeled G above:

❯ perf diff 25.data 26.data
              +78.62%  parser                [.] memchr::arch::x86_64::avx2::packedpair::Finder::find_impl
               +2.65%  parser                [.] memchr::arch::x86_64::avx2::memchr::One::find_raw_avx2
    13.27%     -0.29%  libc-2.26.so        [.] __memcpy_ssse3
               +0.11%  parser                [.] memchr::memmem::searcher::searcher_kind_avx2
    78.07%             parser                [.] memchr::memmem::x86::avx::std::Forward::find_impl
     2.83%             parser                [.] memchr::memchr::x86::avx::memchr
     0.24%             parser                [.] memchr::memmem::Finder::find

The scenario that G is in is that it’s searching a huge number of bytes and never finding the needle in the haystack. So that is why the performance progression is so much worse for this test scenario. Note that this is also using a 5 byte Finder/needle

Any advice in trying to track down or recreate this with a benchmark? I know that this report is somewhat unhelpful in diagnosing the actual issue, but I felt it better to at least post what information I can so that it’s on the radar.

System info

Ec2 c6a.12xlarge

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 16 (9 by maintainers)

Commits related to this issue

Most upvoted comments

The following disgusting patch gets rid of the extra mov for me. Perhaps it can inspire a non-disgusting patch. The idea is rather than keeping {cur, index1, index2} all live at once, pre-compute (cur1, cur2) = (cur.add(index1), cur.add(index2)) and then you can forget about the indices.

diff --git a/src/arch/generic/packedpair.rs b/src/arch/generic/packedpair.rs
index 8d97cf2..2667cc8 100644
--- a/src/arch/generic/packedpair.rs
+++ b/src/arch/generic/packedpair.rs
@@ -93,7 +93,6 @@ impl<V: Vector> Finder<V> {
         let start = haystack.as_ptr();
         let end = start.add(haystack.len());
         let max = end.sub(self.min_haystack_len);
-        let mut cur = start;
 
         // N.B. I did experiment with unrolling the loop to deal with size(V)
         // bytes at a time and 2*size(V) bytes at a time. The double unroll
@@ -101,14 +100,21 @@ impl<V: Vector> Finder<V> {
         // slower. In the end, I decided the complexity from unrolling wasn't
         // worth it. I used the memmem/krate/prebuilt/huge-en/ benchmarks to
         // compare.
-        while cur <= max {
-            if let Some(chunki) = self.find_in_chunk(needle, cur, end, all) {
-                return Some(matched(start, cur, chunki));
+        let index1 = usize::from(self.pair.index1());
+        let index2 = usize::from(self.pair.index2());
+        let mut cur1 = start.add(index1);
+        let mut cur2 = start.add(index2);
+        let end1 = end.add(index1);
+        let max1 = max.add(index1);
+        while cur1 <= max1 {
+            if let Some(chunki) = self.find_in_chunk(needle, cur1, cur2, end1, all) {
+                return Some(matched(start, cur1.sub(index1), chunki));
             }
-            cur = cur.add(V::BYTES);
+            cur1 = cur1.add(V::BYTES);
+            cur2 = cur2.add(V::BYTES);
         }
-        if cur < end {
-            let remaining = end.distance(cur);
+        if cur1 < end1 {
+            let remaining = end1.distance(cur1);
             debug_assert!(
                 remaining < self.min_haystack_len,
                 "remaining bytes should be smaller than the minimum haystack \
@@ -120,10 +126,10 @@ impl<V: Vector> Finder<V> {
                 return None;
             }
             debug_assert!(
-                max < cur,
+                max1 < cur1,
                 "after main loop, cur should have exceeded max",
             );
-            let overlap = cur.distance(max);
+            let overlap = cur1.distance(max1);
             debug_assert!(
                 overlap > 0,
                 "overlap ({}) must always be non-zero",
@@ -140,10 +146,11 @@ impl<V: Vector> Finder<V> {
             // occur in find_in_chunk within the overlap are automatically
             // ignored.
             let mask = V::Mask::all_zeros_except_least_significant(overlap);
-            cur = max;
-            let m = self.find_in_chunk(needle, cur, end, mask);
+            cur1 = max1;
+            cur2 = max1.sub(index1).add(index2);
+            let m = self.find_in_chunk(needle, cur1, cur2, end1, mask);
             if let Some(chunki) = m {
-                return Some(matched(start, cur, chunki));
+                return Some(matched(start, cur1.sub(index1), chunki));
             }
         }
         None
@@ -229,25 +236,24 @@ impl<V: Vector> Finder<V> {
     unsafe fn find_in_chunk(
         &self,
         needle: &[u8],
-        cur: *const u8,
-        end: *const u8,
+        cur1: *const u8,
+        cur2: *const u8,
+        end1: *const u8,
         mask: V::Mask,
     ) -> Option<usize> {
-        let index1 = usize::from(self.pair.index1());
-        let index2 = usize::from(self.pair.index2());
-        let chunk1 = V::load_unaligned(cur.add(index1));
-        let chunk2 = V::load_unaligned(cur.add(index2));
+        let chunk1 = V::load_unaligned(cur1);
+        let chunk2 = V::load_unaligned(cur2);
         let eq1 = chunk1.cmpeq(self.v1);
         let eq2 = chunk2.cmpeq(self.v2);
 
         let mut offsets = eq1.and(eq2).movemask().and(mask);
         while offsets.has_non_zero() {
             let offset = offsets.first_offset();
-            let cur = cur.add(offset);
-            if end.sub(needle.len()) < cur {
+            let cur1 = cur1.add(offset);
+            if end1.sub(needle.len()) < cur1 {
                 return None;
             }
-            if is_equal_raw(needle.as_ptr(), cur, needle.len()) {
+            if is_equal_raw(needle.as_ptr(), cur1.sub(self.pair.index1() as usize), needle.len()) {
                 return Some(offset);
             }
             offsets = offsets.clear_least_significant_bit();

For anyone who might be willing to help, I’ve created a bit more of a refined reproduction of the issue that shows the codegen problem here: https://github.com/BurntSushi/memchr-2.6-mov-regression

Any help would be appreciated!

This seems to be a register pressure issue. For me, LLVM has hoisted this load of index2, but spills it to the stack as 0x10(%rsp) instead of keeping it in a register: https://github.com/BurntSushi/memchr/blob/2.6.4/src/arch/generic/packedpair.rs#L237

Thanks! I’ll take a look later.

For Assembly, the best way in my experience is to use perf on Linux. It should be pretty clear where the hotspot is in functions like memchr::arch::x86_64::avx2::packedpair::Finder::find_impl. Look for the SIMD instructions starting with v. That should be where most of the time is being spent. Then compare those with the instructions used in memchr 2.5.0.

But I’ll take a look now that I have a repro. Just not sure when. Thank you!