algorithm-archive: Edge case fail in all Huffman Encoding implementations.

Whenever any (or most) of the Huffman Encoding implementations receive a string with only one character multiple times (think 'aaaaaaaaaaaaa', or 'bbbb'). The functions fail and give either errors, or weird empty output. Here are some examples of Python, Java and Lua respectively. I assume this has to be fixed because it nowhere states in the chapter that it shouldn’t work on one-character-strings.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 26 (22 by maintainers)

Most upvoted comments

The way I see it, while making an encoding edge case where the codebook indicates “this string has only the letter A” and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say “the character A corresponds to a 0” and then the bitstring has as many zeroes as the original string has A’s, as that would be internally consistent with how this algorithm handles all other cases.

That’s actually not correct. Most implementations (eg DEFLATE) don’t have a marker for the end of the bitstring, and instead, store the length of the uncompressed data. Because we are storing the bits into a string we had an implicit end marker (end of the string). IMHO the correct encoding result is the empty string, but the decode function should somehow be told what the target length is. This can be done by example by making the encoder return an Encoded object that stores the length and bits.

I mean, Kroppeb is right here. In actual implementations, it makes sense that we keep track of the length of the data… This should be mentioned in the chapter (and we should also explicitly link to deflate), but I feel that it is better to just return the 0’s for now because it’s easier to reason about for now (it’s also what I was taught in data compression, so I think it is the more common approach for math folks).

I have been wanting to talk about the Vitter algorithm for a while as an extension to this one and think that is a good place to introduce the more technical aspects.

I think returning an empty sequence makes more sense, because the algorithm is all about compression. and (Leaf 'a' 100, []) is a heck of a lot more compressed than (Leaf 'a' 100, "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000")

I took a quick look and this issue is also there in the Haskell implementation. The encoded procedure returns an empty string and the decoded procedure returns an infinite list of b’s.

My guess is that this is also due to the same reason as mine: leaves and non-leaf nodes are treated differently. So it might be possible to fix in one of the two ways I suggested above.