Thread: Text Compression Contest

  1. #61
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Sebastiani View Post
    Following that logic, why not just store an index into all of the books ever published up to 2009? That could be done in just a handful of bytes. Or better yet, just store the whole book. Then you could achieve perfect compression. The point is, you're not making the all important distinction between algorithm and data.
    A short dictionary of common words, like I mentioned in a previous post, is 50-100k. Notice, War and Peace is 30-60 times that size. So, "algorithmically speaking", what I suggested makes sense. You are just being silly.

    I do make the "all important" distinction. I have identified the data as data that is "encoded" in the English language, which there are many many millions and billions of documents to which this could apply. There are surely more documents written in English than all of the computer programs ever written and that ratio will be drastic. I am not saying I will just make my algortithm == the data. That is like saying the dictionary == Shakespeare, which is ridiculous.

    Programmers assume 8-bit bytes simply because that's the conventional size used on most machines. Even so, internally they may analize and process them in various bit-widths. At any rate, a true compression algorithm (including gzip) can be easily modified to, say, work with 10-bit bytes (in fact the Huffman compressor I posted the other day allows you to do just that without even altering the source code; it's a template parameter).
    Keyword: modified. As I said, you assume the status quo ("conventional size") and you are right to do so! By the same reasoning, I am assuming English is, in fact, a standardized, conventional language. SAME reasoning.

    Anyway, it would be compressed. The average word is 5 characters. A short int is two bytes. Then I think one byte for state information -- but that could be optimized, so the compression will be 50-75%. COMPRESSION.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  2. #62
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Theres no need to hard code the dictionary, just build it at run time from teh acual words and phrases present in the document.

  3. #63
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by abachler View Post
    Theres no need to hard code the dictionary, just build it at run time from teh acual words and phrases present in the document.
    Well but this way I shave off some kb and some analysis. Is there much point in identifying patterns like this:

    they
    it
    always
    also
    tumor
    satisfactory

    I mean if you can assume there are 8 bits in a byte, you might as well assume an English language document is in English.

    Then you just identify words/patterns that repeat that are not pre-defined ("Dracula" "Cervantes" "muttonhead", "royale", "nOOb",etc). This may be a dead-end (as Mario pointed out, it is all done by now! none of you will come up with anything new that is better that what already is either) but it is not invalid.

    Besides, I was going to pull the dictionary from War & Peace anyway but now I will just sulk.
    Last edited by MK27; 10-02-2009 at 04:21 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  4. #64
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> A short dictionary of common words, like I mentioned in a previous post, is 50-100k. Notice, War and Peace is 30-60 times that size. So, "algorithmically speaking", what I suggested makes sense. You are just being silly.

    I don't see how it's silly that such a distinction between code and data should be drawn. It makes perfect sense to me.

    >> I do make the "all important" distinction. I have identified the data as data that is "encoded" in the English language, which there are many many millions and billions of documents to which this could apply.

    Fine, but that still not compression - it's just "mapping words to indexes" and vice versa.

    >> There are surely more documents written in English than all of the computer programs ever written and that ratio will be drastic. I am not saying I will just make my algortithm == the data. That is like saying the dictionary == Shakespeare, which is ridiculous.

    All I'm saying is that your program should work on anything. Executables, plain text, xml, images - whatever. Raw data. No assumptions. No predefined data. No artificial constraints. No hints. That's my definition of a true compression algorithm. Anything else is merely a curiosity, IMO.

    >> Keyword: modified. As I said, you assume the status quo ("conventional size") and you are right to do so! By the same reasoning, I am assuming English is, in fact, a standardized, conventional language. SAME reasoning.

    Yes, modified - once (and anyway, this is just an artifact of short-sighted design - it could have just as easily been done that way from the very beginning). So I really don't see how the two compare at all.

  5. #65
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Except the document isn't guaranteed to be in english, as I previously stated, it could be in Klingon or Esperanto for all you know.

  6. #66
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I think it is a "true" compression algorithm. Just that, for non-English content, the compressed size will be slightly larger than uncompressed. But that applies to all algorithms. There is always input whose compressed output will be larger, simply because you cannot represent all possible combinations of N bits with less than N bits. All compression algorithms need to make assumptions about the content. If the content matches the assumption, compressed size will be smaller, otherwise bigger.

  7. #67
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> I mean if you can assume there are 8 bits in a byte, you might as well assume an English language document is in English.

    Again, the assumption there is completely unnecessary.

    >> as Mario pointed out, it is all done by now! none of you will come up with anything new that is better that what already is either

    That's like saying there is nothing left to be discovered in physics - it's simply not true. In fact, when I was working on the Huffman algorithm I discovered a new technique for transmitting the compression model that reduced the lower bound from 256 bytes to just 3! Granted, that's just an improvement on a very old algorithm, but the point is, there will always be as-yet undiscovered techniques.

  8. #68
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> But that applies to all algorithms. There is always input whose compressed output will be larger, simply because you cannot represent all possible combinations of N bits with less than N bits.

    Sure, for any single algorithm, that is of course true.

    >> All compression algorithms need to make assumptions about the content. If the content matches the assumption, compressed size will be smaller, otherwise bigger.

    Assumptions, no. Optimized for a particular set of input, absolutely.

  9. #69
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by Sebastiani View Post
    That's like saying there is nothing left to be discovered in physics - it's simply not true.
    What I meant in fact is that it's highly unlikely anyone in here will come up with anything new and meaningful, and not that it can't be made. It may be possible. Just not here and just not in this contest. So we agree, but I felt I needed that to be clear.

    Meanwhile the -- already annoying -- dictionary approach has a few problems like what you describe. What it makes particularly annoying in this contest is that we were specifically told by the contest rules we can't access the file system. So:

    - The dictionary would have to either travel with the compressed data and lost when decompressed, or remain in memory for both operations. Because we don't even have access to the caller (on this case, main), we have to build the dictionary twice. Once on each function, or carry it with the compressed data. Crude.

    - I agree with MK27 on one thing though, there's no requirement for a wide purpose compression algorithm. Only to achieve the highest compression possible (in the fastest way) with our own code. We can't make no assumptions about the language of the text... but we do know it's an ascii file.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  10. #70
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> What I meant in fact is that it's highly unlikely anyone in here will come up with anything new and meaningful, and not that it can't be made. It may be possible. Just not here and just not in this contest. So we agree, but I felt I needed that to be clear.

    Maybe. But then again...

    Seriously, though, even without inventing a new algorithm, an improvement upon an existing one is a worthwhile endeavor, in my book.

  11. #71
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Mario F. View Post
    We can't make no assumptions about the language of the text... but we do know it's an ascii file.
    Good point; altho I don't know how many languages (besides English) that it could be, and still be an ascii file...so someone could gamble and HARDCODE a dictionary.
    Last edited by MK27; 10-04-2009 at 04:24 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  12. #72
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by MK27 View Post
    Good point; altho I don't know how many languages (besides English) that it could be, and still be an ascii file...so someone could gamble and HARDCODE a dictionary.
    Most languages have a latinized translation.
    It's not english, that's all Ill say.
    Last edited by abachler; 10-04-2009 at 05:42 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Detecting Edit Control Text
    By yoonkwun in forum Windows Programming
    Replies: 2
    Last Post: 07-12-2009, 08:48 AM
  2. DirectX | Drawing text
    By gavra in forum Game Programming
    Replies: 4
    Last Post: 06-08-2009, 12:23 AM
  3. Small HTML question
    By Thantos in forum Tech Board
    Replies: 4
    Last Post: 12-29-2003, 12:37 AM
  4. WANTED: Contest Master
    By kermi3 in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-23-2003, 10:15 PM
  5. Ok, Structs, I need help I am not familiar with them
    By incognito in forum C++ Programming
    Replies: 7
    Last Post: 06-29-2002, 09:45 PM