algorithm-archive: Edge case fail in all Huffman Encoding implementations.
Whenever any (or most) of the Huffman Encoding implementations receive a string with only one character multiple times (think 'aaaaaaaaaaaaa'
, or 'bbbb'
). The functions fail and give either errors, or weird empty output. Here are some examples of Python, Java and Lua respectively. I assume this has to be fixed because it nowhere states in the chapter that it shouldn’t work on one-character-strings.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 26 (22 by maintainers)
I mean, Kroppeb is right here. In actual implementations, it makes sense that we keep track of the length of the data… This should be mentioned in the chapter (and we should also explicitly link to deflate), but I feel that it is better to just return the 0’s for now because it’s easier to reason about for now (it’s also what I was taught in data compression, so I think it is the more common approach for math folks).
I have been wanting to talk about the Vitter algorithm for a while as an extension to this one and think that is a good place to introduce the more technical aspects.
I think returning an empty sequence makes more sense, because the algorithm is all about compression. and
(Leaf 'a' 100, [])
is a heck of a lot more compressed than(Leaf 'a' 100, "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000")
I took a quick look and this issue is also there in the Haskell implementation. The
encoded
procedure returns an empty string and thedecoded
procedure returns an infinite list ofb
’s.My guess is that this is also due to the same reason as mine: leaves and non-leaf nodes are treated differently. So it might be possible to fix in one of the two ways I suggested above.
https://stackoverflow.com/questions/22429854/huffman-code-for-a-single-character