logos: `static LUT: [bool; 256]` compression planned?

Would it be beneficial perf-wise to turn static LUT: [bool; 256] into a static LUT: [usize; 256 / 8 / size_of::<usize>()] instead and then index packed bits? At least it’s more space efficient, which may be attractive for people with more complicated token rules than I have.

Doing a quick Ctrl+F through my generated code, I have 179 such tables. That’d be a reduction from about 44KiB worth of lookup tables to about 5KiB.

On a similar note I have 57 const LUT: [u8; 256]s. I dunno whether they are optimised away or kept, but if kept, most of them seem to only contain numbers <16, most of them are less than ¼ filled, and most of them cluster occurring non-zero entries. There is potential for selectively compressing some LUTs to [u8; 128]s for 4-bit look-ups, and even sparse table allocations. But those space optimisations may have too much of a runtime impact. Dunno.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

Implemented this just now, got a tad of a performance boost across benchmarks, while the code that’s being spit out is smaller, so win-win!

Actual generated code (look at those 255s!):

        static COMPACT_TABLE_0: [u8; 256] = [
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 127, 127, 127, 127, 127,
            127, 127, 127, 127, 127, 0, 0, 0, 0, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255,
            255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
            255, 0, 0, 0, 0, 255, 0, 223, 255, 255, 255, 239, 255, 255, 255, 253, 255, 255, 255,
            255, 255, 255, 255, 255, 255, 191, 251, 255, 247, 255, 255, 255, 255, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        ];

        // ...

        #[inline]
        fn pattern0(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 1 > 0
        }

        // ...

        #[inline]
        fn pattern4(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 2 > 0
        }
        #[inline]
        fn pattern5(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 4 > 0
        }
        #[inline]
        fn pattern6(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 8 > 0
        }
        #[inline]
        fn pattern7(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 16 > 0
        }

        // ...

        #[inline]
        fn pattern8(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 32 > 0
        }
        #[inline]
        fn pattern9(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 64 > 0
        }

        // ...

        #[inline]
        fn pattern10(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 128 > 0
        }

8 patterns, 1 table 🎉.

Ok, I’ve read the ASM, I’ve run preliminary benchmarks.

Turns out there is basically no difference in performance between a table of bools matching LUT[byte as usize], and table of u8s matching LUT[byte as usize] & BITMASK > 0 where BITMASK is 1 shifted left by 0-7 (actually the latter seems a tiny bit faster for me right now 🤷‍♂️).

hot loop for [bool; 256]:

.LBB0_2:
        movzx   edx, byte ptr [rdi + rax]
        cmp     byte ptr [rdx + rcx], 0
        je      .LBB0_4
        add     rax, 1
        cmp     rsi, rax
        jne     .LBB0_2

hot loop for [u8; 256] using 4 as a mask:

.LBB1_2:
        movzx   edx, byte ptr [rdi + rax]
        test    byte ptr [rdx + rcx], 4
        je      .LBB1_4
        add     rax, 1
        cmp     rsi, rax
        jne     .LBB1_2

This is so obvious now that I think about this. Instead of trying to compress the tables, we can just stack 8 tables into a single [u8; 256]. Given that code gen works recursively through the state machine graph, this should also play really nicely with cache locality, since your odds of reusing the same table with different mask will be pretty darn high (heck, with inlining for monstrous patterns like going through \w it might keep the pointer to the same table in a register).

I figured out something that might be useful: If the pattern match is for bytes within a range of 64 from one another (that is highest_byte - lowest_byte < 64) instead of using a table, we can do a lookup on a u64 bitset. Playground:

#[no_mangle]
fn packed_low64(bytes: &[u8]) -> usize {
    const MASK: u64 = (1 << 34) | (1 << 32) | (1 << 30);
    
    bytes
        .iter()
        .take_while(|byte| match 1u64.checked_shl((**byte as u32) % 64) {
            Some(shift) => shift & MASK != 0,
            None => false,
        })
        .count()
}

#[no_mangle]
fn lut(bytes: &[u8]) -> usize {
    const LUT: [bool; 256] = {
        let mut lut = [false; 256];
        lut[34] = true;
        lut[32] = true;
        lut[30] = true;
        lut
    };

    bytes.iter().take_while(|byte| LUT[**byte as usize]).count()
}

ASM:


packed_low64:
        xor     eax, eax
        test    rsi, rsi
        je      .LBB0_4
        movabs  rcx, 22548578304
.LBB0_2:
        movzx   edx, byte ptr [rdi + rax]
        bt      rcx, rdx
        jae     .LBB0_4
        add     rax, 1
        cmp     rsi, rax
        jne     .LBB0_2
.LBB0_4:
        ret

lut:
        xor     eax, eax
        test    rsi, rsi
        je      .LBB1_4
        lea     rcx, [rip + .L__unnamed_1]
.LBB1_2:
        movzx   edx, byte ptr [rdi + rax]
        cmp     byte ptr [rdx + rcx], 0
        je      .LBB1_4
        add     rax, 1
        cmp     rsi, rax
        jne     .LBB1_2
.LBB1_4:
        ret

.L__unnamed_1:
        .asciz  "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\001\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

This optimizes much nicer than I thought it would, effectively replaces cmp with bt. This is kind of the best case scenario here since it assumes highest byte value in pattern is 63, but adding a single subtraction shouldn’t cost much (and should be faster otherwise since it doesn’t have to load the table).

Edit: funnily enough, a naive implementation of just having a match **byte { 34 | 32 | 30 => true, _ => false } in the hot loop, produces almost identical ASM, it figures out to produce the same constant but it also adds a branch ahead of bt:


naive:
        xor     eax, eax
        test    rsi, rsi
        je      .LBB2_5
        movabs  rcx, 22548578304
.LBB2_2:
        movzx   edx, byte ptr [rdi + rax]
        cmp     rdx, 34
        ja      .LBB2_5
        bt      rcx, rdx
        jae     .LBB2_5
        add     rax, 1
        cmp     rsi, rax
        jne     .LBB2_2
.LBB2_5:
        ret