A Method for the Construction of Minimum-Redundancy Codes

**Sebastiani** · 09-18-2009

This, of course, was the title of David A. Huffman's now-famous term paper describing one of the most important compression models ever. And even as some algorithms available today far exceed it's capabilities, it nonetheless remains a crucial achievement in the field of information theory that cannot be ignored. Many modern encoders even use Huffman encoding as a preprocessing step due to it's simplicity and general performance characteristics.

The challenge is to implement the algorithm in such a way that produces the smallest output file, and this will be the primary factor in deciding the winner. Bragging rights will also be rewarded to the fastest implementation, as well as the one requiring the least amount of overhead.

There's no prize for being the winner, unfortunately, EXCEPT that if the implementation submitted produces smaller output files* than the solution that I will be posting as a benchmark (which won't be an official entry), the winner WILL receive a $50 Gift Certificate.

Rules are as follows:

(1) The entry can be written in any language as long as:
- The source is provided, obviously
- An entry point to a C-style function with the following name and signature is provided:

void encode( const char* in_file, const char* out_file, int mode_is_compress );

The function should read data from in_file and write the output to out_file, compressing the data if mode_is_compress is non-zero, otherwise decompress.

(2) No non-constant global variable are allowed. All of the various data structures needed to process the data must be created and destroyed within the entry point. This restriction applies to external files as well.

(3) No assumptions should be made about the format of an uncompressed file. It may contain any byte value ranging from 0 to 255.

(4) No assumptions should be made about the size of an input file, except that it will some value less than 2^32.

(5) An implementation that crashes during the test will be disqualified.

(6) No third party libraries will be allowed. Standard libraries are fine unless it is determined that they result in some 'unfair' advantage. If in doubt, just ask.

(7) Submissions must be original. No plagerized works allowed.

(8) No entries employing non-Huffman compression schemes will be allowed.

(9) Entries shall be submitted via PM.

At this point, there isn't a formal deadline for submissions. I'm going to give it about a week or so to see who is going to participate, and at that point we can take an informal vote to determine what the deadline should be, based on everyone's schedules.

I've attached below the benchmark. It is, of course, compressed.

Full source code will be provided at the end of the contest.

Good luck.

* Conditions/Restrictions:
- The implementation must not favor any particular data format (such as text, for example)
- The output file must consistently be more than four bytes smaller than the benchmark. That may sound like an odd requirement, but the reasoning is that the benchmark could have been designed to output four bytes less, but I opted to keep it for the fact that it improved error-detection significantly.

**tabstop** · 09-18-2009

So, if we don't know that the in_file is text, or null-terminated, how do we know how big it is? And how are we supposed to know that out_file points to enough space to write the output? Or are those parameters supposed to be FILE * instead?

**Sebastiani** · 09-18-2009

Originally Posted by tabstop

So, if we don't know that the in_file is text, or null-terminated, how do we know how big it is? And how are we supposed to know that out_file points to enough space to write the output? Or are those parameters supposed to be FILE * instead?

The input file can be queried with something like 'stat', or opening it and seeking the end, etc. The output file won't exist until your function creates it. The reason I didn't opt for using FILE* was just to be flexible, eg: in case you want to use std streams instead, for example.

EDIT: and just to clarify by "make no assumptions about the size of the input" I mean don't structure your implementation in such a way that requires a certain maximum input size

**hcm2009** · 10-12-2009

Originally Posted by Sebastiani

The input file can be queried with something like 'stat', or opening it and seeking the end, etc. The output file won't exist until your function creates it. The reason I didn't opt for using FILE* was just to be flexible, eg: in case you want to use std streams instead, for example.

EDIT: and just to clarify by "make no assumptions about the size of the input" I mean don't structure your implementation in such a way that requires a certain maximum input size

Can you give some more samples? How great your info is! It really useful for me. Thanks.
________________
Anime Watching

**tabstop** · 09-18-2009

So we're getting passed in the filenames then. Gotcha. Now where'd I put my discrete book....

**Sebastiani** · 09-18-2009

Originally Posted by tabstop

So we're getting passed in the filenames then. Gotcha. Now where'd I put my discrete book....

Right. Sorry, I should have been more clear on that.

**tabstop** · 09-18-2009

Originally Posted by Sebastiani

Right. Sorry, I should have been more clear on that.

Eh, const char * out_file should have been a real real real real real big clue.

**abachler** · 09-18-2009

define non-huffman, if 99% of the compression takes place using non-huffman methods and then I use huffman to preprocess or post-process the data does that count as a huffman method, or are we limited to only those methods specifically discussed in his paper?

**Sebastiani** · 09-18-2009

Originally Posted by abachler

define non-huffman, if 99% of the compression takes place using non-huffman methods and then I use huffman to preprocess or post-process the data does that count as a huffman method, or are we limited to only those methods specifically discussed in his paper?

The output basically consists of two things: (1) the compression model and (2) the actual data that has been compressed. The former doesn't matter at all - you can use whatever format you want. The latter, however, must only use Huffman encoding on the symbols. That is to say, you can't use arithmetic coding to compress the data, obviously. And just to be clear, Huffman coding does not dictate what particular symbol gets mapped to which code - it's completely arbitrary. All that matters is that the code is unambiguous, and 'minimally redundant'.

**Sebastiani** · 09-22-2009

Okay, well, it looks like we don't have any contestants at this point, so...for now, at least, I'm officially cancelling the contest.

If anyone's interested in reopening it, just send me a PM.

Thread: A Method for the Construction of Minimum-Redundancy Codes

Thread Tools

Search Thread

Display

Hybrid View

A Method for the Construction of Minimum-Redundancy Codes