Originally Posted by
Sebastiani
Huffman encoding isn't exactly an algorithm, per se, though. It really just describes a method for setting up a direct mapping of variable-length to fixed-length codes, and in such a way that the variable-length codes can be detected unambigously. That's basically it. It doesn't even address the format to be used for storing the compression model and statistics (of course, in some cases they aren't even necessary, say, when the frequency of each symbol is known apriori (or at least a close approximation), but that's pretty rare). As such, there are many different types of data structures in use, many of them quite inefficient - I've even seen one that requires as much as 3 kilobytes to store the compression model. That doesn't help much in maintaining respectable compression ratios!
The second problem is that, strange as it may sound, many implementations don't actually select the shortest codeword for a given symbol. The reasoning is that by doing so you can guarantee that the longest symbol doesn't exceed some maximum number of bits, such as 8 (compared with of 255), which also allows you to make all sort of optimizations, as well. Nothing wrong with that approach, I guess, if you can get it to perform well, but from my experience this is not generally the case. I'd be interested to find out if this is just because it isn't being implemented correctly, just an accepted trade off, or some sort of flaw in the threory itself? Not sure.
So the whole point of this exercise is to explore the fundamental limits of the method, and see just how far it can be improved, really.