logos: `static LUT: [bool; 256]` compression planned?
Would it be beneficial perf-wise to turn static LUT: [bool; 256]
into a static LUT: [usize; 256 / 8 / size_of::<usize>()]
instead and then index packed bits? At least it’s more space efficient, which may be attractive for people with more complicated token rules than I have.
Doing a quick Ctrl+F
through my generated code, I have 179 such tables. That’d be a reduction from about 44KiB worth of lookup tables to about 5KiB.
On a similar note I have 57 const LUT: [u8; 256]
s. I dunno whether they are optimised away or kept, but if kept, most of them seem to only contain numbers <16
, most of them are less than ¼ filled, and most of them cluster occurring non-zero entries. There is potential for selectively compressing some LUT
s to [u8; 128]
s for 4-bit look-ups, and even sparse table allocations. But those space optimisations may have too much of a runtime impact. Dunno.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (8 by maintainers)
Implemented this just now, got a tad of a performance boost across benchmarks, while the code that’s being spit out is smaller, so win-win!
Actual generated code (look at those
255
s!):8 patterns, 1 table 🎉.
Ok, I’ve read the ASM, I’ve run preliminary benchmarks.
Turns out there is basically no difference in performance between a table of
bool
s matchingLUT[byte as usize]
, and table ofu8
s matchingLUT[byte as usize] & BITMASK > 0
whereBITMASK
is 1 shifted left by 0-7 (actually the latter seems a tiny bit faster for me right now 🤷♂️).hot loop for
[bool; 256]
:hot loop for
[u8; 256]
using4
as a mask:This is so obvious now that I think about this. Instead of trying to compress the tables, we can just stack 8 tables into a single
[u8; 256]
. Given that code gen works recursively through the state machine graph, this should also play really nicely with cache locality, since your odds of reusing the same table with different mask will be pretty darn high (heck, with inlining for monstrous patterns like going through\w
it might keep the pointer to the same table in a register).I figured out something that might be useful: If the pattern match is for bytes within a range of 64 from one another (that is
highest_byte - lowest_byte < 64
) instead of using a table, we can do a lookup on au64
bitset. Playground:ASM:
This optimizes much nicer than I thought it would, effectively replaces
cmp
withbt
. This is kind of the best case scenario here since it assumes highest byte value in pattern is 63, but adding a single subtraction shouldn’t cost much (and should be faster otherwise since it doesn’t have to load the table).Edit: funnily enough, a naive implementation of just having a
match **byte { 34 | 32 | 30 => true, _ => false }
in the hot loop, produces almost identical ASM, it figures out to produce the same constant but it also adds a branch ahead of bt: