![]() |
| | #1 | |
| Guest Join Date: Aug 2001
Posts: 4,923
| A Method for the Construction of Minimum-Redundancy Codes The challenge is to implement the algorithm in such a way that produces the smallest output file, and this will be the primary factor in deciding the winner. Bragging rights will also be rewarded to the fastest implementation, as well as the one requiring the least amount of overhead. There's no prize for being the winner, unfortunately, EXCEPT that if the implementation submitted produces smaller output files* than the solution that I will be posting as a benchmark (which won't be an official entry), the winner WILL receive a $50 Gift Certificate. Rules are as follows: Quote:
I've attached below the benchmark. It is, of course, compressed. Full source code will be provided at the end of the contest.Good luck. * Conditions/Restrictions: - The implementation must not favor any particular data format (such as text, for example) - The output file must consistently be more than four bytes smaller than the benchmark. That may sound like an odd requirement, but the reasoning is that the benchmark could have been designed to output four bytes less, but I opted to keep it for the fact that it improved error-detection significantly. Last edited by Sebastiani; 09-18-2009 at 10:38 PM. Reason: formatting | |
| Sebastiani is offline | |
| | #2 |
| and the Hat of Guessing Join Date: Nov 2007
Posts: 8,740
| So, if we don't know that the in_file is text, or null-terminated, how do we know how big it is? And how are we supposed to know that out_file points to enough space to write the output? Or are those parameters supposed to be FILE * instead? |
| tabstop is offline | |
| | #3 | |
| Guest Join Date: Aug 2001
Posts: 4,923
| Quote:
EDIT: and just to clarify by "make no assumptions about the size of the input" I mean don't structure your implementation in such a way that requires a certain maximum input size Last edited by Sebastiani; 09-18-2009 at 07:39 PM. | |
| Sebastiani is offline | |
| | #4 |
| and the Hat of Guessing Join Date: Nov 2007
Posts: 8,740
| So we're getting passed in the filenames then. Gotcha. Now where'd I put my discrete book.... |
| tabstop is offline | |
| | #5 |
| Guest Join Date: Aug 2001
Posts: 4,923
| |
| Sebastiani is offline | |
| | #6 |
| and the Hat of Guessing Join Date: Nov 2007
Posts: 8,740
| |
| tabstop is offline | |
| | #7 |
| Rampaging 35 Stone Welsh Join Date: Apr 2007
Posts: 2,924
| define non-huffman, if 99% of the compression takes place using non-huffman methods and then I use huffman to preprocess or post-process the data does that count as a huffman method, or are we limited to only those methods specifically discussed in his paper?
__________________ He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet |
| abachler is offline | |
| | #8 |
| Guest Join Date: Aug 2001
Posts: 4,923
| The output basically consists of two things: (1) the compression model and (2) the actual data that has been compressed. The former doesn't matter at all - you can use whatever format you want. The latter, however, must only use Huffman encoding on the symbols. That is to say, you can't use arithmetic coding to compress the data, obviously. And just to be clear, Huffman coding does not dictate what particular symbol gets mapped to which code - it's completely arbitrary. All that matters is that the code is unambiguous, and 'minimally redundant'. |
| Sebastiani is offline | |
| | #9 | |
| Rampaging 35 Stone Welsh Join Date: Apr 2007
Posts: 2,924
| Quote:
Now we will actually need to know the nature of the input data because the preprocessor used depends on knowing what data is being compressed.
__________________ He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet | |
| abachler is offline | |
| | #10 | |
| Guest Join Date: Aug 2001
Posts: 4,923
| Quote:
EDIT: And just to be clear, this means you can't even apply something as simple as RLE on the data, as that is a form of compression. The two areas you should be concentrating on, then, are: - Identifying the most efficient encoding of a given symbol. Many implementations don't address this correctly and yield larger than necessary codewords. - Choosing an efficient format for the compression model/statistics. This is also an area that is often neglected. Last edited by Sebastiani; 09-18-2009 at 10:49 PM. | |
| Sebastiani is offline | |
| | #11 |
| Rampaging 35 Stone Welsh Join Date: Apr 2007
Posts: 2,924
| So as long as its not compressed in the preprocessing stage, we can do anything we want, like convert it form one format to another? Because if we are limited to only performing Huffman, I don't really see what the point of the contest is, because Huffman is a set algorithm.
__________________ He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet |
| abachler is offline | |
| | #12 | |
| Guest Join Date: Aug 2001
Posts: 4,923
| Quote:
The second problem is that, strange as it may sound, many implementations don't actually select the shortest codeword for a given symbol. The reasoning is that by doing so you can guarantee that the longest symbol doesn't exceed some maximum number of bits, such as 8 (compared with of 255), which also allows you to make all sort of optimizations, as well. Nothing wrong with that approach, I guess, if you can get it to perform well, but from my experience this is not generally the case. I'd be interested to find out if this is just because it isn't being implemented correctly, just an accepted trade off, or some sort of flaw in the threory itself? Not sure. So the whole point of this exercise is to explore the fundamental limits of the method, and see just how far it can be improved, really. | |
| Sebastiani is offline | |
| | #13 | |
| Rampaging 35 Stone Welsh Join Date: Apr 2007
Posts: 2,924
| Quote:
On real systems, the symbol length is generally limited to the largest register size the processor can natively handle. On x86 systems thats 32 bits. It's a tradeoff, but one that sacrifices extremely little compression for very large performance gains. bitwise operations longer than 32 bits on a 32 bit machine take exponentially longer than those that are 32 bits or smaller. symbol lengths longer than 32 are rare and infrequent and generally represent symbols that themselves occur very few times in the original data. A worse case scenario a 33 bit symbol would represent data that occurred at a maximum 3.0303% of the time, while one that took 32 bits would be 3.125% so you would lose at most 0.0009469% compression. The point being that the real gains in compression are to be had in the preprocessing section, that is massaging the data to be more compressible in teh first place, this is how all modern compression works. To vastly over-simplify things - JPEG does t by reducing the color count, MPEG does it by reducing the sub-frame count, MP3 does it by reducing bandwidth, ZIP and RAR both do it by RLE, symbol translation, and many other methods.
__________________ He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet | |
| abachler is offline | |
| | #14 |
| Guest Join Date: Aug 2001
Posts: 4,923
| That's really not how Huffman compression works, though. First of all, the encoded symbol of any particular byte being a number of possible sizes, ranging from 1 to 255 bits, means that things don't usually fall on some even boundary (eg: 32 bits, say). For example, let's say the code for a certain symbol is 100000000000001. Encoding 3 characters sends the 35-bit string 100000000000001100000000000001100000000000001 to the output stream. The next symbol might append 1 bit to that, another 5 bits. The only time padding is inserted is at the very end of the compression process to square things up on a byte boundary. Next, there is no "messaging" of the data involved at any point. To do so would equate to some other form of compression, obviously, which defeats the purpose altogether. Think of it this way: your encoder should assume that whatever preprocessing is needed to reduce the input size has already been applied to the data. It's only job is to produce Huffman codes - nothing else. So it's really just a matter of delegation. Even modern compressors don't expect the Huffman encoder to preform RLE on the input, or any other form of compression - those things are done prior to sending it to to the Huffman phase, if necessary. |
| Sebastiani is offline | |
| | #15 | |
| Rampaging 35 Stone Welsh Join Date: Apr 2007
Posts: 2,924
| Quote:
__________________ He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet | |
| abachler is offline | |
![]() |
| Thread Tools | |
| Display Modes | |
|