entropy: Inaccuracies in LZ complexity estimates

Hi Raphael!

I’m a big fan of your work and I am a regular user of your packages entropy, pingouin as well as yasa. These are great contributions, and I thank you for them 😃

I’ve been relying on your Numba implementation of lempel_ziv_complexity for various analyses I’m working on. I found myself in need for a faster implementation and comments in your code pointed me to Naereen’s Cython implementation.

I briefly compared the timings and accuracy of your Numba version with both the pure Python and Cython versions by Naereen. With string inputs, somehow the Numba version is slower than pure Python, both of which are slower than the Cython version. But more importantly, the estimates are different. The pure Python and Cython versions return the exact same estimates, but the Numba version returns different estimates, which are usually smaller.

I tried all three methods on 1000 randomly generated strings each for lengths [10, 100, 1000, 10000] Here’s a quick visual summary of the timings, raw estimates and differences between the 3. The difference in estimates grows with string length. 😦

test

I tried out a few examples by hand too. For example:

  • the string rpfqvnradv is decomposed by LZ76 to ['r', 'p', 'f', 'q', 'v', 'n', 'ra', 'd'] or 8 unique substrings
    • Both the Cython and pure Python versions return 8
    • The Numba version returns 9
  • the string msduldaefd is decomposed by LZ76 to ['m', 's', 'd', 'u', 'l', 'da', 'e', 'f'] or 8 unique substrings
    • Both the Cython and pure Python versions return 8
    • The Numba version returns 9

I haven’t spent any time troubleshooting the Numba code beyond this. If I get time to go through your implementation, I’ll see if there’s anything I can do to help or address this issue.


Here are versions of packages I’ve used on Python 3.8.5:

  • Numba - 0.51.2
  • Numpy - 1.19.2
  • EntroPy - 0.1.2
  • Cython - 0.29.21

System: 10th Gen Intel Core i5-1035G4 running PopOS 20.10, kernel 5.8; gcc 10.2.0; llvmlite 0.34.0

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Hey @raphaelvallat

The Numba implementation is just one line away from working with strings! Any input string can be converted to its numeric ASCII representation very quickly, and this integer array can now directly be passed into the Numba function.

For example:

string = 'aardvark'

# Fast numpy function for converting an iterable to an array
np.fromiter(map(ord, string), count=len(string), dtype='uint64')

>>> array([ 97,  97, 114, 100, 118,  97, 114, 107], dtype=uint64) # Works with Numba

Since ASCII is a one-to-one mapping, the representation is lossless 😃

Alternatively, you could also look at memory views on character arrays from strings. (Especially if you’re working with unicode characters)


Numba vs Cython

  • For arrays of size ~10,000, the difference is barely noticeable
  • For larger arrays, the Cython version can be upto ~1.4-1.5x faster I’ve implemented Cython functions for both strings and arrays, the latter is generally faster: gist_timings_cython

No problem @Naereen, it’s alright Your code has been a major contribution and has inspired several other implementations! Thank you 😃

Hi @Naereen and @raphaelvallat,

Thanks for addressing these concerns 😃

I ran some benchmarks for LZ complexity estimation based on 5 different implementations:

  1. Naereen’s current Cython implementation
  2. Naereen’s current Python implementation
  3. Naereen’s previous Python implementation (prior to patching issue 5 with commit 5537546 that altered estimates)
  4. Raphael’s Numba implementation (based on 3 above)
  5. Quang Thai’s MATLAB implementation

I sampled randomly generated binary sequences - 1000 each for lengths 100, 1000 and 10000 in MATLAB and ran the above 5 methods. I found that implementations 3, 4 and 5 gave identical estimates for all 3000 sequences. 1 and 2 give the same estimates but these were always different from 3/4/5. I’ve attached the LZC estimates as well as the raw sequences just in case.

I think @Naereen’s new implementation lead to some potentially breaking changes? The previous version - which appears to have been used by Raphael, and is also in the notebooks - appears to be correct.

Hope this helps. Once we agree on correctness, I want to see if something can be done about performance 😃

Best, Pranay

Yes I would be happy to submit PRs to both packages 😃 I’ll probably get around to it by this weekend, hope that’s okay

@raphaelvallat you’d prefer the Numba one - correct? (with string-to-int array conversion) I ask this because Cython on Windows can be non-trivial to setup (except WSL) and you are currently not using it in your package.

@Naereen would you prefer the array or string cython versions? (or both?)