A Method for the Construction of Minimum-Redundancy Codes

**Sebastiani** · 09-18-2009

Originally Posted by abachler

define non-huffman, if 99% of the compression takes place using non-huffman methods and then I use huffman to preprocess or post-process the data does that count as a huffman method, or are we limited to only those methods specifically discussed in his paper?

The output basically consists of two things: (1) the compression model and (2) the actual data that has been compressed. The former doesn't matter at all - you can use whatever format you want. The latter, however, must only use Huffman encoding on the symbols. That is to say, you can't use arithmetic coding to compress the data, obviously. And just to be clear, Huffman coding does not dictate what particular symbol gets mapped to which code - it's completely arbitrary. All that matters is that the code is unambiguous, and 'minimally redundant'.

**abachler** · 09-18-2009

Originally Posted by Sebastiani

The output basically consists of two things: (1) the compression model and (2) the actual data that has been compressed. The former doesn't matter at all - you can use whatever format you want. The latter, however, must only use Huffman encoding on the symbols. That is to say, you can't use arithmetic coding to compress the data, obviously. And just to be clear, Huffman coding does not dictate what particular symbol gets mapped to which code - it's completely arbitrary. All that matters is that the code is unambiguous, and 'minimally redundant'.

So we can use any method we want to preprocess the data as long as the final post process compression is done using Huffman?

Now we will actually need to know the nature of the input data because the preprocessor used depends on knowing what data is being compressed.

**Sebastiani** · 09-18-2009

Originally Posted by abachler

So we can use any method we want to preprocess the data as long as the final post process compression is done using Huffman?

Now we will actually need to know the nature of the input data because the preprocessor used depends on knowing what data is being compressed.

No. The input data must not be preprocessed in any way (unless by 'preprocess' you simply mean to analyze it, in which case that's perfectly fine). In other words, if you want to pass it through an LZW compressor for the sole purpose of gathering statistics, go for it - as long as the data is not compressed by anything other than the Huffman encoder by the time it ends up in the output file, rule #8 hasn't been violated.

EDIT: And just to be clear, this means you can't even apply something as simple as RLE on the data, as that is a form of compression. The two areas you should be concentrating on, then, are:

- Identifying the most efficient encoding of a given symbol. Many implementations don't address this correctly and yield larger than necessary codewords.
- Choosing an efficient format for the compression model/statistics. This is also an area that is often neglected.

**abachler** · 09-20-2009

So as long as its not compressed in the preprocessing stage, we can do anything we want, like convert it form one format to another?

Because if we are limited to only performing Huffman, I don't really see what the point of the contest is, because Huffman is a set algorithm.

**Sebastiani** · 09-20-2009

Originally Posted by abachler

So as long as its not compressed in the preprocessing stage, we can do anything we want, like convert it form one format to another?

Because if we are limited to only performing Huffman, I don't really see what the point of the contest is, because Huffman is a set algorithm.

Huffman encoding isn't exactly an algorithm, per se, though. It really just describes a method for setting up a direct mapping of variable-length to fixed-length codes, and in such a way that the variable-length codes can be detected unambigously. That's basically it. It doesn't even address the format to be used for storing the compression model and statistics (of course, in some cases they aren't even necessary, say, when the frequency of each symbol is known apriori (or at least a close approximation), but that's pretty rare). As such, there are many different types of data structures in use, many of them quite inefficient - I've even seen one that requires as much as 3 kilobytes to store the compression model. That doesn't help much in maintaining respectable compression ratios!

The second problem is that, strange as it may sound, many implementations don't actually select the shortest codeword for a given symbol. The reasoning is that by doing so you can guarantee that the longest symbol doesn't exceed some maximum number of bits, such as 8 (compared with of 255), which also allows you to make all sort of optimizations, as well. Nothing wrong with that approach, I guess, if you can get it to perform well, but from my experience this is not generally the case. I'd be interested to find out if this is just because it isn't being implemented correctly, just an accepted trade off, or some sort of flaw in the threory itself? Not sure.

So the whole point of this exercise is to explore the fundamental limits of the method, and see just how far it can be improved, really.

**abachler** · 09-20-2009

Originally Posted by Sebastiani

Huffman encoding isn't exactly an algorithm, per se, though. It really just describes a method for setting up a direct mapping of variable-length to fixed-length codes, and in such a way that the variable-length codes can be detected unambigously. That's basically it. It doesn't even address the format to be used for storing the compression model and statistics (of course, in some cases they aren't even necessary, say, when the frequency of each symbol is known apriori (or at least a close approximation), but that's pretty rare). As such, there are many different types of data structures in use, many of them quite inefficient - I've even seen one that requires as much as 3 kilobytes to store the compression model. That doesn't help much in maintaining respectable compression ratios!

The second problem is that, strange as it may sound, many implementations don't actually select the shortest codeword for a given symbol. The reasoning is that by doing so you can guarantee that the longest symbol doesn't exceed some maximum number of bits, such as 8 (compared with of 255), which also allows you to make all sort of optimizations, as well. Nothing wrong with that approach, I guess, if you can get it to perform well, but from my experience this is not generally the case. I'd be interested to find out if this is just because it isn't being implemented correctly, just an accepted trade off, or some sort of flaw in the threory itself? Not sure.

So the whole point of this exercise is to explore the fundamental limits of the method, and see just how far it can be improved, really.

Larger dictionaries often lead to BETTER compression, not he other way around.

On real systems, the symbol length is generally limited to the largest register size the processor can natively handle. On x86 systems thats 32 bits. It's a tradeoff, but one that sacrifices extremely little compression for very large performance gains. bitwise operations longer than 32 bits on a 32 bit machine take exponentially longer than those that are 32 bits or smaller. symbol lengths longer than 32 are rare and infrequent and generally represent symbols that themselves occur very few times in the original data. A worse case scenario a 33 bit symbol would represent data that occurred at a maximum 3.0303% of the time, while one that took 32 bits would be 3.125% so you would lose at most 0.0009469% compression. The point being that the real gains in compression are to be had in the preprocessing section, that is massaging the data to be more compressible in teh first place, this is how all modern compression works. To vastly over-simplify things - JPEG does t by reducing the color count, MPEG does it by reducing the sub-frame count, MP3 does it by reducing bandwidth, ZIP and RAR both do it by RLE, symbol translation, and many other methods.

**Sebastiani** · 09-20-2009

That's really not how Huffman compression works, though. First of all, the encoded symbol of any particular byte being a number of possible sizes, ranging from 1 to 255 bits, means that things don't usually fall on some even boundary (eg: 32 bits, say). For example, let's say the code for a certain symbol is 100000000000001. Encoding 3 characters sends the 35-bit string 100000000000001100000000000001100000000000001 to the output stream. The next symbol might append 1 bit to that, another 5 bits. The only time padding is inserted is at the very end of the compression process to square things up on a byte boundary.

Next, there is no "messaging" of the data involved at any point. To do so would equate to some other form of compression, obviously, which defeats the purpose altogether. Think of it this way: your encoder should assume that whatever preprocessing is needed to reduce the input size has already been applied to the data. It's only job is to produce Huffman codes - nothing else. So it's really just a matter of delegation. Even modern compressors don't expect the Huffman encoder to preform RLE on the input, or any other form of compression - those things are done prior to sending it to to the Huffman phase, if necessary.

Thread: A Method for the Construction of Minimum-Redundancy Codes

Thread Tools

Search Thread

Display

Hybrid View