logos: `static LUT: [bool; 256]` compression planned?
Would it be beneficial perf-wise to turn static LUT: [bool; 256] into a static LUT: [usize; 256 / 8 / size_of::<usize>()] instead and then index packed bits? At least it’s more space efficient, which may be attractive for people with more complicated token rules than I have.
Doing a quick Ctrl+F through my generated code, I have 179 such tables. That’d be a reduction from about 44KiB worth of lookup tables to about 5KiB.
On a similar note I have 57 const LUT: [u8; 256]s. I dunno whether they are optimised away or kept, but if kept, most of them seem to only contain numbers <16, most of them are less than ¼ filled, and most of them cluster occurring non-zero entries. There is potential for selectively compressing some LUTs to [u8; 128]s for 4-bit look-ups, and even sparse table allocations. But those space optimisations may have too much of a runtime impact. Dunno.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (8 by maintainers)
Implemented this just now, got a tad of a performance boost across benchmarks, while the code that’s being spit out is smaller, so win-win!
Actual generated code (look at those
255s!):8 patterns, 1 table 🎉.
Ok, I’ve read the ASM, I’ve run preliminary benchmarks.
Turns out there is basically no difference in performance between a table of
bools matchingLUT[byte as usize], and table ofu8s matchingLUT[byte as usize] & BITMASK > 0whereBITMASKis 1 shifted left by 0-7 (actually the latter seems a tiny bit faster for me right now 🤷♂️).hot loop for
[bool; 256]:hot loop for
[u8; 256]using4as a mask:This is so obvious now that I think about this. Instead of trying to compress the tables, we can just stack 8 tables into a single
[u8; 256]. Given that code gen works recursively through the state machine graph, this should also play really nicely with cache locality, since your odds of reusing the same table with different mask will be pretty darn high (heck, with inlining for monstrous patterns like going through\wit might keep the pointer to the same table in a register).I figured out something that might be useful: If the pattern match is for bytes within a range of 64 from one another (that is
highest_byte - lowest_byte < 64) instead of using a table, we can do a lookup on au64bitset. Playground:ASM:
This optimizes much nicer than I thought it would, effectively replaces
cmpwithbt. This is kind of the best case scenario here since it assumes highest byte value in pattern is 63, but adding a single subtraction shouldn’t cost much (and should be faster otherwise since it doesn’t have to load the table).Edit: funnily enough, a naive implementation of just having a
match **byte { 34 | 32 | 30 => true, _ => false }in the hot loop, produces almost identical ASM, it figures out to produce the same constant but it also adds a branch ahead of bt: