Thread: Text Compression Contest

  1. #1
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195

    Text Compression Contest

    OK, this is similar to sebastiani's contest, except there are no rules on what type of compression you use. The target is a txt file in ascii format.

    Best compression within a reasonable time limit wins.

    Implement your compression as a function that accepts a pointer to a zero terminated string consisting of the entire text of War and Peace as input and a dereferenced pointer (pointer to a pointer) where it should place the output. The function must also return the number of bytes used in the output. Below is an example declaration -

    Code:
    unsigned long abachler_compress(char* Input , unsigned char** Ouput);
    Implement your decompressor as a function that takes a pointer to an array containing the compressed data, the size of the array, and returns a pointer to the decompressed output, which must be zero terminated. An example is below -
    Code:
    char* abachler_decompress(unsigned char* Input , unsigned long InputSize);
    Rules -
    Must be written in C++, C code is fine as long as it compiles as C++.
    No use of global variables.
    No use of the file system.
    All code must be yours, no use of 3rd party compression libraries or API calls.
    If your function crashes the application for any reason you lose.
    Inline assembly is fine as long as it does not jump outside the assembly block.
    Runtime is limited to 1 hour maximum on a Pentium 4 3.2 GHz with HT and 2GB ram.
    You may use threads.
    All submissions must be made by 23:59CST (-6 GMT) on 9/30/2009

    The following is the main() function that will test the application I highly suggest you use this exact code for testing as no modifications will be made to accommodate anyones code. -

    Code:
    #include <malloc.h>
    #include <stdio.h>
    
    #define SIZE_OF_WAR_AND_PEACE 0 // will be actual size of the file
    
    unsigned long abachler_compress(char* Input , unsigned char** Ouput);
    char* abachler_decompress(unsigned char* Input , unsigned long InputSize);
    
    int main(){
    
    	FILE* pFile = fopen("warandpeace.txt" , "r+b");
    
    	char* pRandomVariableName = (char*)malloc(SIZE_OF_WAR_AND_PEACE);
    	fread(pRandomVariableName , 1 , SIZE_OF_WAR_AND_PEACE , pFile);
    	fclose(pFile);
    	
    	unsigned char* pCompressed;
    	unsigned long DataSize = abachler_compress(pRandomVariableName , &pCompressed);
    
    	// compressed data may be copied to a new array at this point
    
    	char* pDecompressed = abachler_decompress(pCompressed , DataSize);
    		
    	return 0;
    	}
    Last edited by abachler; 09-23-2009 at 12:36 AM.

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by abachler
    All code must be yours, no use of 3rd party compression libraries or API calls.
    You might want to clarify if use of the standard library is allowed, and if use of libraries unrelated to compression (e.g., a library for threads) is allowed as long as the relevant library files are provided (and presumably the application is rejected if a reasonable attempt to compile it fails).

    Quote Originally Posted by abachler
    The following is the main() function that will test the application I highly suggest you use this exact code for testing as no modifications will be made to accommodate anyones code.
    I suggest that you change <malloc.h> to <stdlib.h> since <malloc.h> is non-standard/pre-standard. The Comeau online compiler is an example of a rather standard conformant C++ compiler that does not accept your program due to a missing header file.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Sep 2009
    Posts
    63
    Very interesting. Even if I don't submit anything, I do think this will be a neat exercise.

  4. #4
    'Allo, 'Allo, Allo
    Join Date
    Apr 2008
    Posts
    639
    Quote Originally Posted by abachler View Post
    no modifications will be made to accommodate anyones code.
    Well, with the leak intact...

    Code:
    unsigned int compress(char* text, char** compressed)
    {
        *compressed = text;
        return sizeof(*compressed);
    }
    
    char* decompress(char* compressed, unsigned int /*len*/)
    {
        return compressed;
    }

  5. #5
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by adeyblue View Post
    Well, with the leak intact...
    There's no leak, windows implicitly frees the memory when the process terminates.

    While your entry is technically valid, I'm not even going to bother running it since it will lose to any non-smartass entry.
    It also fails to decompress the data properly. The last time i checked War and Peace was longer than 4 bytes.
    Consider yourself disqualified.

    Quote Originally Posted by laserlight View Post
    I suggest that you change <malloc.h> to <stdlib.h> since <malloc.h> is non-standard/pre-standard. The Comeau online compiler is an example of a rather standard conformant C++ compiler that does not accept your program due to a missing header file.
    malloc.h is present on my system, and since you are not allowed to modify main() in any case it is irrelevant. If you require specific headers you need to include them in your submission. You can use any library you want with the caveat that I will not be linking OBJ LIB or DLL files to the project, so you either have to include something that comes standard with Visual Studio 2008 Express or figure out a way to include the source code for the libraries in your submission.

    I would suggest you focus on the contest and not trying to nitpick the rules.
    Last edited by abachler; 09-24-2009 at 12:00 AM.

  6. #6
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    The rules have a gaping hole, in that they don't forbid the use of the actual text of War and Peace. I could embed a reference text in my program, and output the diff between that text and the input. If I got lucky and used exactly the same text you used (from Gutenberg, right?) I might have an output size of 0 bytes while remaining within the rules.

    It's actually amazing that you selected that text, because it's the same text I use as a mid-size corpus for language processing experiments. I have a LOT of experience with that text, and I know its statistics and idiosyncracies
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  7. #7
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by brewbuck View Post
    The rules have a gaping hole, in that they don't forbid the use of the actual text of War and Peace. I could embed a reference text in my program, and output the diff between that text and the input. If I got lucky and used exactly the same text you used (from Gutenberg, right?) I might have an output size of 0 bytes while remaining within the rules.

    It's actually amazing that you selected that text, because it's the same text I use as a mid-size corpus for language processing experiments. I have a LOT of experience with that text, and I know its statistics and idiosyncracies
    The gaping hole runs smack into cboards upload limit of course...
    so no gaping hole. I also didn't specify what language it is in.

  8. #8
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    Since it's an ASCII file I would hazard a guess it is English.

  9. #9
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Quote Originally Posted by abachler View Post
    The gaping hole runs smack into cboards upload limit of course...
    so no gaping hole. I also didn't specify what language it is in.
    I think it would be a good idea measure the size of BOTH the compressed text and the decompresser, like the calgary challenge. That would eliminate these kind of gaping holes.

  10. #10
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by whiteflags View Post
    Since it's an ASCII file I would hazard a guess it is English.
    Don't bet on it, it could be in Klingon for all you know.

  11. #11
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by abachler
    malloc.h is present on my system, and since you are not allowed to modify main() in any case it is irrelevant.
    It is relevant because you are going to test with that code and you "highly suggest (submitters) use this exact code for testing as no modifications will be made to accommodate anyones code" (emphasis mine). Of course, it is only a theoretical problem since one should be able to safely change <malloc.h> to <stdlib.h> without fear that the code will work on their system but not on yours.

    Quote Originally Posted by abachler
    I would suggest you focus on the contest and not trying to nitpick the rules.
    Hey, I am just trying to help you improve your contest rules
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  12. #12
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by laserlight View Post
    Hey, I am just trying to help you improve your contest rules
    There's nothing wrong with the rules as they are. No professional compiler would be so anal as to NOT include malloc.h if only for compatibility with pre-standard code. If your implementation doesn't support malloc.h then too bad, it's not my problem. As I said, I highly SUGGEST, you are free to do whatever you want, if it fails to run it fails to run. But, as laserlight pointed out, it is unlikely that any implementation would run with stdlib and fail to run with malloc. Since malloc is used in the main function, not in your function, you are free to include whatever headers you desire, within the limits of the contest.

    As a sign of good faith, I will provide a very simple compressor. Obviously this doesn't meet the requirements of the contest, as I rolle dthe whole thing into main. it also suffers from poor compression because of a known weakness, that is each letter gets compressed by 2 bits, but every non letter, specifically spaces, eat an extra 6 bits. Overall it saved only 60KB out fo a 3MB file. Like I said its very crude compression.

    Code:
    int main(){
    	char* pBuffer = (char*)malloc(1048576);
    
    	FILE* pFile = fopen("e:\\warandpeace.txt" , "r+b");
    	long BytesRead = -1;
    	unsigned long ByteCount[256];
    	for(int x = 0;x<256;x++) ByteCount[x] = 0;
    	unsigned long TotalBytes = 0;
    	
    	while(BytesRead){
    		BytesRead = fread(pBuffer , 1 , 1048576 , pFile);
    		if(BytesRead) for(int x=0;x<BytesRead;x++) ByteCount[pBuffer[x]]++;
    		TotalBytes += BytesRead;
    		}
    
    	fclose(pFile);
    
    	
    	pFile = 0;
    	pFile = fopen("e:\\ByteCount.txt" , "w+b");
    	long Index = 0;
    	for(int x = 0;x<256;x++){
    		fprintf(pFile , "%2.2x - %d%c%c" , x , ByteCount[x] , 13 , 10);
    		if(ByteCount[x]) Index++;
    		}
    	fprintf(pFile , "%c%cTotal Symbols Used %d%c%c" , 13 , 10 , Index , 13 , 10);
    	fclose(pFile);
    
    	pBuffer = (char*)realloc(pBuffer , TotalBytes);
    
    	pFile = fopen("e:\\warandpeace.txt" , "r+b");
    	fread(pBuffer , 1 , TotalBytes , pFile);
    	fclose(pFile);
    
    	char* pOutput = (char*)malloc(TotalBytes);
    	unsigned long BitCache = 0;
    	long BitIndex = 0;
    	Index = 0;
    
    	pFile = fopen("e:\\warandpeace.paq" , "w+b");
    	for(int x = 0;x<TotalBytes;x++){
    		if((pBuffer[x] > 64) && (pBuffer[x] < 128)){
    			BitCache *= 64;
    			BitCache = BitCache & 0xFFFFFFC4;
    			BitCache = BitCache | (pBuffer[x] & 63);
    			BitIndex += 6;
    			} else {
    			BitCache *= 64;
    			BitCache = BitCache & 0xFFFFFFC4;
    			BitIndex += 6;
    
    			BitCache *= 256;
    			BitCache = BitCache & 0xFFFFFF00;
    			BitCache = BitCache | pBuffer[x];
    			BitIndex += 8;
    			}
    
    		while(BitIndex > 7){
    			pOutput[Index] = BitCache & 0xFF;
    			BitCache /= 256;
    			BitCache = BitCache & 0x00FFFFFF;
    			BitIndex -= 8;
    			Index++;
    			}
    		}
    
    	while(BitIndex > 0){
    		pOutput[Index] = BitCache & 0xFF;
    		BitCache /= 256;
    		BitCache = BitCache & 0x00FFFFFF;
    		BitIndex -= 8;
    		Index++;
    		}
    	
    	fwrite(&TotalBytes , sizeof(unsigned long) , 1 , pFile);
    	fwrite(&Index , sizeof(unsigned long) , 1 , pFile);
    	fwrite(pOutput , 1 , Index , pFile);
    
    	fclose(pFile);
    
    	return 0;
    	}
    Last edited by abachler; 09-25-2009 at 11:41 AM.

  13. #13
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    And of course, this is a "This is my contest. Take it or leave it. No comments or suggestions accepted nor welcomed" contest.

  14. #14
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    I've always wanted to make one that used a RNG to compress. The file would basically be just a set of 265 number pairs. The seed, and the skip. The seed would be the seed you need to use for the RNG to produce the required set, and the skip would be how far down that seed you need to skip before you can start using it. And also, the final size of the output file in bytes.

    It would basically look like this:
    Code:
    type seedset[2][265];
    for each of 256
        read pairs into seedset
    
    for each of 256
        seed( seedset[0][x] )
        skip( seedset[1][x] ) // basically a loop to eat rand calls
        output x (where x is the current number 0 to 255 ) at rand()
        // say we're on loop 'a' ... here we skip, then we start outputting 'a' at rand()
        // position from where we last put an 'a' (or from byte 0 in the case of the first call)
    In theory (monkeys and typewriters) it should be killer. You'd just need a quantum computer to calculate the 256 pairs you'd need to generate the correct output file.

    It's still a fun idea.


    Quzah.
    Last edited by quzah; 09-28-2009 at 03:38 PM.
    Hope is the first step on the road to disappointment.

  15. #15
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by quzah View Post
    I've always wanted to make one that used a RNG to compress. The file would basically be just a set of 265 number pairs. The seed, and the skip. The seed would be the seed you need to use for the RNG to produce the required set, and the skip would be how far down that seed you need to skip before you can start using it. And also, the final size of the output file in bytes.

    It's still a fun idea.


    Quzah.
    So basically a bible code compressor using a an RNG with a run length of at least the factorial of 256^ length of the book. Yeah I'm thinking the seed would probably exceed the size of the original file. But it's a good idea anyway.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Detecting Edit Control Text
    By yoonkwun in forum Windows Programming
    Replies: 2
    Last Post: 07-12-2009, 08:48 AM
  2. DirectX | Drawing text
    By gavra in forum Game Programming
    Replies: 4
    Last Post: 06-08-2009, 12:23 AM
  3. Small HTML question
    By Thantos in forum Tech Board
    Replies: 4
    Last Post: 12-29-2003, 12:37 AM
  4. WANTED: Contest Master
    By kermi3 in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-23-2003, 10:15 PM
  5. Ok, Structs, I need help I am not familiar with them
    By incognito in forum C++ Programming
    Replies: 7
    Last Post: 06-29-2002, 09:45 PM