Thread: Text Compression Contest

  1. #46
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by bithub View Post
    You guys are both right, yet are completely misunderstanding each other.

    Quzah's technique will work. It will not work with completely arbitrary, random data. But it will work with data that follows some sort of pattern. The trick is figuring out what the pattern is.

    Abachler is saying that Quzah's technique will not work with completely arbitrary, random data. This is true, because by definition, random data doesn't have a pattern. Can anything be completely random though?

    See, you're both right.
    That's exactly the point I was trying to make. Specifically, when I simply posted:

    "Numbers are only random because we cannot see a connection between them."

    If you CAN express a connection between them, and saying that "HEY! Every N is X!" takes up less room in the file then actually leaving all of them there, then that's compression.

    Thank you. That's exactly what I've been trying to say this entire time!


    Quzah.
    Hope is the first step on the road to disappointment.

  2. #47
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    The challenge was to do with War and Peace, which probably contains <50000 different words. AFAICT just googling around, of course you get better compression if you are compressing text in a particular language. If this were not true, you could impose endless compression:

    myfile.gz.gz.gz.gz

    etc, each time the file gets smaller. Obviously this is not true -- at some point, all the patterns in the text have been reduced to the point where they are not patterns anymore, they are symbols of various instances (or some version of same)*.

    My "cub car" idea was to create a dictionary, meaning 95% of the words could be represented with a short int (then I needed an extra byte for each word to represent the state of the local punctuation). But on average that is only going to compress like 50-60%.

    * okay, maybe that's a sketchy-inaccurate description but pretty sure my aim is true here
    Last edited by MK27; 10-01-2009 at 05:35 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  3. #48
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> The challenge was to do with War and Peace, which probably contains <50000 different words. AFAICT just googling around, of course you get better compression if you are compressing text in a particular language.

    I don't think the language matters much. As long as you can find some pattern, there is an opportunity for compression.

    >> If this were not true, you could impose endless compression:

    That just depends on the particular algorithm, and how it creates mappings between symbols. Reprocessing the output doesn't generally lead to compression, but that isn't always necessarily the case. Obviously, once the data reaches (or starts out in) a maximum state of entropy, further compression isn't possible. Just keep in mind that one algorithm's entropy is another's orderly distribution - it's a relative thing, really.

    >> My "cub car" idea was to create a dictionary, meaning 95% of the words could be represented with a short int (then I needed an extra byte for each word to represent the state of the local punctuation). But on average that is only going to compress like 50-60%

    That's basically how LZW (the basis of WinZIP, et al) works. The biggest drawback to that algorithm is that it usually requires quite a bit of memory, and the amount needed can grow rather rapidly (most implementations just flush the dictionary at some predetermined point to deal with that, though). It's a pretty useful algorithm, though. It generally compresses things like text quite well, is extremely fast, and is adaptive, meaning you don't have to transmit the compression model (something that could easily take up to 2048 bytes).

  4. #49
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    It will be hard to come up with anything meaningful that hasn't bee done already. That's what makes Qzah idea a whole lot more interesting.

    In any case, any submissions already?
    If there's a week extension to the deadline, I'd join in. Otherwise I hope you rot in hell.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  5. #50
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    There haven't been any serious entries yet, so i guess its extended.

  6. #51
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Very well then. I'll be working on it this weekend. Cheers.
    While I will not come up with anything revolutionary (as if!), it will at least be something new... and serious.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  7. #52
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Sebastiani View Post
    >> The challenge was to do with War and Peace, which probably contains <50000 different words. AFAICT just googling around, of course you get better compression if you are compressing text in a particular language.

    I don't think the language matters much. As long as you can find some pattern, there is an opportunity for compression.
    Except that a huge portion of the dictionary could be hard-coded, saving much time in analyzing the text, and also reducing the filesize.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  8. #53
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    That isn't really compression, though, per se, as much as it is an indexing scheme. Think of it this way: if you had to transmit the data somewhere that didn't have access to the database, the receiver couldn't reconstruct the original message.

  9. #54
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Meanwhile the requirements for the context don't allow any kind of on disk file access. The database would have to stay in memory and the application would have to stay open after compression in order for it to be able to decompress.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  10. #55
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Sebastiani View Post
    That isn't really compression, though, per se, as much as it is an indexing scheme. Think of it this way: if you had to transmit the data somewhere that didn't have access to the database, the receiver couldn't reconstruct the original message.
    Well, if you gzipped something and sent it somewhere that didn't have gzip, you'd be out of luck there too. The idea, obviously, is to include a small dictionary (say 40000 words, 200k) in the application.

    However, I don't think this is necessarily a good idea; just it was my idea
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #56
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> Well, if you gzipped something and sent it somewhere that didn't have gzip, you'd be out of luck there too. The idea, obviously, is to include a small dictionary (say 40000 words, 200k) in the application.

    But gzip is just an algorithm. What you're talking about is just a form of 'data hiding'. Consider this: if I asked you to design a program that rendered 3d-graphics, and you simply created images using photoshop and imbedded them into your application, would that really count? The point is, you're storing the 'result' somewhere, pretending that it's part of the algorithm.

  12. #57
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Sebastiani View Post
    But gzip is just an algorithm. What you're talking about is just a form of 'data hiding'. Consider this: if I asked you to design a program that rendered 3d-graphics, and you simply created images using photoshop and imbedded them into your application, would that really count? The point is, you're storing the 'result' somewhere, pretending that it's part of the algorithm.
    Pretty sure I would not be violating the rules, altho it is unlikely to win anyway.

    But I disagree with your analogy. Your compression technique must presume something about about a file, eg, that it contains text or even that it is byte-structured. Since the rule was for ascii, you could further presume certain things. The 3D app with embedded photographs is unlikely to be useful because of the specificity of the content and the infinity of possibilities, but the English language is standardized and finite. If you are dealing with published material, you can also assume the spelling is mostly correct and a hard-coded dictionary of common words (could) be very useful.

    So it is not totally crazy. Certainly, it might significantly reduce the task of pattern matching, however that applies to whatever algorithm.

    And the gzip algorithm is "hard-coded" in gzip, so again, the distinction is on a certain level but not absolute IMO.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  13. #58
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> Your compression technique must presume something about about a file, eg, that it contains text or even that it is byte-structured.

    Not at all! One (perhaps unspoken) heuristic is that a true compression algorithm should be able to encode any sequence of input (optimally or otherwise), regardless of the format. And it may operate on a 'byte level' but that has nothing to do with the actual nature of the input. You could store 300 8-byte double-precision numbers in a file, and as far as the compressor is concerned, it's just a plain string of 2400 bytes.

    >> The 3D app with embedded photographs is unlikely to be useful because of the specificity of the content and the infinity of possibilities, but the English language is standardized and finite.

    And what you're proposing is designed specifically for an English-language translation of 'War and Peace' in pure ASCII format. Furthermore, English is hardly 'standardized and finite'. What if the compressed the data was discovered 100 years from now? Wouldn't they need a 2009 dictionary to decode it?

    >> And the gzip algorithm is "hard-coded" in gzip, so again, the distinction is on a certain level but not absolute IMO.

    You're comparing apples to oranges. Gzip is a general purpose, universal, and time-independant algorithm. What you're proposing is a special-purpose, specific, and time-sensitive combination of data and algorithm. I'm afraid that the only way to turn your Pinoccio into a real boy would be to transmit the dictionary along with the data. Of course, then you really *would* need a miracle to achieve any compression with that.

  14. #59
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Sebastiani View Post
    And what you're proposing is designed specifically for an English-language translation of 'War and Peace' in pure ASCII format. Furthermore, English is hardly 'standardized and finite'.
    No, it would be designed specifically for the English language. You may notice we use standardized tools such as spell checkers that use pre-defined dictionaries and there is nothing wrong with those.

    Quote Originally Posted by Sebastiani View Post
    You could store 300 8-byte double-precision numbers in a file, and as far as the compressor is concerned, it's just a plain string of 2400 bytes.
    What I meant was you are then ASSUMING an 8 bit = byte structure. Unless the compression works on a bit level and makes no further distinctions, it must apply a "pre-defined" strategy, so your idea that this means the compression can be used on "any data" is false. You are just resting on the status quo, as it were. So, I'm placing the status quo on a higher level: the English language. This is more limiting, but it may also allow further optimization.

    Quote Originally Posted by Sebastiani View Post
    I'm afraid that the only way to turn your Pinoccio into a real boy would be to transmit the dictionary along with the data. Of course, then you really *would* need a miracle to achieve any compression with that.
    Well, I did not claim it was in the same category as the real boy However, I will defend the claim that the dictionary can be included in the application AND NOT THE FILE. When you spell check a document, you use the exact same dictionary as you use on every other document. You do not have to save the dictionary along with the file -- that would be dunderheaded.

    Anyway, once again, I agree it is not very promising. However, I am sure you could achieve 50-75% compression that way, which is compression. If you want to redefine the word "compression" to mean "only compression which is performed according to my criteria" then I would point you to a dictionary.


    com⋅pres⋅sion
      –noun
    1. the act of compressing.
    2. the state of being compressed.
    3. the effect, result, or consequence of being compressed.
    4. (in internal-combustion engines) the reduction in volume and increase of pressure of the air or combustible mixture in the cylinder prior to ignition, produced by the motion of the piston toward the cylinder head after intake.
    5. Also called data compression. reduction of the storage space required for data by changing its format.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  15. #60
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> However, I will defend the claim that the dictionary can be included in the application AND NOT THE FILE.

    Following that logic, why not just store an index into all of the books ever published up to 2009? That could be done in just a handful of bytes. Or better yet, just store the whole book. Then you could achieve perfect compression. The point is, you're not making the all important distinction between algorithm and data.

    >> What I meant was you are then ASSUMING an 8 bit = byte structure. Unless the compression works on a bit level and makes no further distinctions, it must apply a "pre-defined" strategy, so your idea that this means the compression can be used on "any data" is false.

    Programmers assume 8-bit bytes simply because that's the conventional size used on most machines. Even so, internally they may analize and process them in various bit-widths. At any rate, a true compression algorithm (including gzip) can be easily modified to, say, work with 10-bit bytes (in fact the Huffman compressor I posted the other day allows you to do just that without even altering the source code; it's a template parameter).

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Detecting Edit Control Text
    By yoonkwun in forum Windows Programming
    Replies: 2
    Last Post: 07-12-2009, 08:48 AM
  2. DirectX | Drawing text
    By gavra in forum Game Programming
    Replies: 4
    Last Post: 06-08-2009, 12:23 AM
  3. Small HTML question
    By Thantos in forum Tech Board
    Replies: 4
    Last Post: 12-29-2003, 12:37 AM
  4. WANTED: Contest Master
    By kermi3 in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-23-2003, 10:15 PM
  5. Ok, Structs, I need help I am not familiar with them
    By incognito in forum C++ Programming
    Replies: 7
    Last Post: 06-29-2002, 09:45 PM