entropy: Inaccuracies in LZ complexity estimates
Hi Raphael!
I’m a big fan of your work and I am a regular user of your packages entropy, pingouin as well as yasa. These are great contributions, and I thank you for them 😃
I’ve been relying on your Numba implementation of lempel_ziv_complexity for various analyses I’m working on. I found myself in need for a faster implementation and comments in your code pointed me to Naereen’s Cython implementation.
I briefly compared the timings and accuracy of your Numba version with both the pure Python and Cython versions by Naereen. With string inputs, somehow the Numba version is slower than pure Python, both of which are slower than the Cython version. But more importantly, the estimates are different. The pure Python and Cython versions return the exact same estimates, but the Numba version returns different estimates, which are usually smaller.
I tried all three methods on 1000 randomly generated strings each for lengths [10, 100, 1000, 10000] Here’s a quick visual summary of the timings, raw estimates and differences between the 3. The difference in estimates grows with string length. 😦

I tried out a few examples by hand too. For example:
- the string
rpfqvnradvis decomposed by LZ76 to['r', 'p', 'f', 'q', 'v', 'n', 'ra', 'd']or8unique substrings- Both the Cython and pure Python versions return
8 - The Numba version returns
9
- Both the Cython and pure Python versions return
- the string
msduldaefdis decomposed by LZ76 to['m', 's', 'd', 'u', 'l', 'da', 'e', 'f']or8unique substrings- Both the Cython and pure Python versions return
8 - The Numba version returns
9
- Both the Cython and pure Python versions return
I haven’t spent any time troubleshooting the Numba code beyond this. If I get time to go through your implementation, I’ll see if there’s anything I can do to help or address this issue.
Here are versions of packages I’ve used on Python 3.8.5:
- Numba - 0.51.2
- Numpy - 1.19.2
- EntroPy - 0.1.2
- Cython - 0.29.21
System: 10th Gen Intel Core i5-1035G4 running PopOS 20.10, kernel 5.8; gcc 10.2.0; llvmlite 0.34.0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (14 by maintainers)
Hey @raphaelvallat
The Numba implementation is just one line away from working with strings! Any input string can be converted to its numeric ASCII representation very quickly, and this integer array can now directly be passed into the Numba function.
For example:
Since ASCII is a one-to-one mapping, the representation is lossless 😃
Alternatively, you could also look at memory views on character arrays from strings. (Especially if you’re working with unicode characters)
Numba vs Cython
No problem @Naereen, it’s alright Your code has been a major contribution and has inspired several other implementations! Thank you 😃
Hi @Naereen and @raphaelvallat,
Thanks for addressing these concerns 😃
I ran some benchmarks for LZ complexity estimation based on 5 different implementations:
5537546that altered estimates)I sampled randomly generated binary sequences - 1000 each for lengths 100, 1000 and 10000 in MATLAB and ran the above 5 methods. I found that implementations 3, 4 and 5 gave identical estimates for all 3000 sequences. 1 and 2 give the same estimates but these were always different from 3/4/5. I’ve attached the LZC estimates as well as the raw sequences just in case.
I think @Naereen’s new implementation lead to some potentially breaking changes? The previous version - which appears to have been used by Raphael, and is also in the notebooks - appears to be correct.
Hope this helps. Once we agree on correctness, I want to see if something can be done about performance 😃
Best, Pranay
Yes I would be happy to submit PRs to both packages 😃 I’ll probably get around to it by this weekend, hope that’s okay
@raphaelvallat you’d prefer the Numba one - correct? (with string-to-int array conversion) I ask this because Cython on Windows can be non-trivial to setup (except WSL) and you are currently not using it in your package.
@Naereen would you prefer the array or string cython versions? (or both?)