Thread: Text Compression Contest

  1. #16
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by cyberfish View Post
    And of course, this is a "This is my contest. Take it or leave it. No comments or suggestions accepted nor welcomed" contest.
    Yeah well we all know how 'some' people prefer to spend their time trying to break a contest, rather than taking part in it.

    If you can't handle a difficult challenge, like writing an actual entry instead of trying to weasel the contest with return 0; entries, then you should probably just not read my contest threads. My contests generally require you to actually write code.

  2. #17
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by abachler View Post
    So basically a bible code compressor using a an RNG with a run length of at least the factorial of 256^ length of the book. Yeah I'm thinking the seed would probably exceed the size of the original file. But it's a good idea anyway.
    You should be able to use digits of PI in the base of your choice. Pull a chain of how frequent each value is, and find a series of digits of _ that match that. You could set your skip bigger and wrap around from the end of the file to the start again, to get better results and fit your start/jump set in a smaller number of bites. For example, you could use 128 bits for each, and be skipping a billion digits of PI (or rand calls) per jump. In theory you should be able to pin down any file in what ... 8K? 128 * 256 * 2? Maybe my math is off, I'm tired. But considering you could do jumps in the millions if you needed to, it seams feasible. But like I said, computing it would be a lot of friggin work.


    Quzah.
    Last edited by quzah; 09-29-2009 at 10:17 PM.
    Hope is the first step on the road to disappointment.

  3. #18
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by quzah View Post
    You should be able to use digits of PI in the base of your choice. Pull a chain of how frequent each value is, and find a series of digits of _ that match that.


    Quzah.
    Yes, but again, you have to store the index into pi where your sequence begins, which ostensibly will be larger than the length of the original data.

    So it would work if you wanted to compress something that was found in the first umpteen bytes like the sequence 14159265, but it would not work for arbitrary data.

  4. #19
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Say we use 128 bit keys. 2^128th is a pretty big number. Therefore, you could pick a starting digit of PI anywhere in there, and a skip number, anywhere in 2^128. All you need is a sequence of numbers that fall anywhere in 2^128 and repeats at a GCD that falls anywhere in 2^128 more jumps. But I don't know anything about cryptology, or compression (it's not interesting to me), it just seems like you could get the math to work, since those are absurdly big numbers.


    Quzah.
    Hope is the first step on the road to disappointment.

  5. #20
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by quzah View Post
    Say we use 128 bit keys. 2^128th is a pretty big number. Therefore, you could pick a starting digit of PI anywhere in there, and a skip number, anywhere in 2^128. All you need is a sequence of numbers that fall anywhere in 2^128 and repeats at a GCD that falls anywhere in 2^128 more jumps. But I don't know anything about cryptology, or compression (it's not interesting to me), it just seems like you could get the math to work, since those are absurdly big numbers.


    Quzah.
    Your idea is interesting, and something I've pondered myself. But let's consider the conditions imposed on pi, if this were true. It would mean that it takes fewer than N digits to describe the offset within pi of any arbitrary N-digit sequence (this is the only way you can achieve compression). It's late, and I've had a few beers. But something tells me that such a number is a logical impossibility (whether that number is pi or otherwise).
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  6. #21
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    the problem is that you arent compressing a single byte, you are compressing a sequence of bytes or bits either way, and to have a sequence with every possible subsequence of size N you need a minimum run length. e.g.

    say I want to compress a series of 2 numbers with values ranging from 0 to 3. Ostensibly you can directly represent that with 4 bits. But to have a sequence with every possible run of 2 digits 0 to 3 takes

    [00102031121312232330]010203112131223233

    or 19 digits, since the start point can be anywhere from 0-18 it takes ln 19/ln 2 binary digits or 4.2479 bits just to represent the index, and you still need to represent the run length, which even if we assume is known to be 2 and takes no additional bits, still leaves you with a larger index than the original data.

    Now that said, if we assume that we always use 2 bit sequences, we can achieve parity with any arbitrary bit sequence using

    [0110]11

    requiring 2 bits for the index (ln 4/ln 2)

    with 1 bit we get parity as well. Since any arbitrary sequence of bits can be reduced to 1 or 2 bit sequences, we can always achieve parity by reduction, but with bit lengths longer than 2 it is not possible to fully represent every sequence of N bits in fewer than 2^N bits, which is the maximum index size that would achieve parity of compression.

    If such a thing were possible it would be possible to achieve arbitrarily high compression ratios by simply compressing the compressed data repeatedly. So the challenge then to prove P or NP is to find a sequence of 2^N-1 bits that contain every possible combination of N bits.

    It is sufficient to search for a sequence of 2^N-1 because any shorter sequence that properly encoded every combination could just be padded with 0's or 1's and would be a valid solution of 2^N-1 length.

    Now I posit that there exists for any positive integer N a sequence of 2^N bits that contains every possible combination of N bits, if you allow wrapping from the end to the beginning, but that there exists no shorter sequence that fulfills the same requirement.

    1 - 01
    2 - 0011
    3 - 00111010
    Last edited by abachler; 09-30-2009 at 02:14 AM.

  7. #22
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Think of it as absurd RLE. But instead of a length of continuous characters, it's the distance you skip until you encounter the same character. I think. Anyway, I started to do something to see if it was possible, and the number gets insane very fast, and I'm probably off, because I'm being lazy and don't care to find an optimum way to do it.

    Make an array of 1000 characters. That's our file, yeah, it's small. "fill it" by calling rand() %256, and only plotting the 'a' characters that result.
    Code:
    for( x = 0; x < 1000; x++ )
        array[ x ] = rand() % 256 == 'a' ? 'a' : 0;
    Now, count up all the times you ran into an 'a' (or just simply print out the index of x when that happens. I got four, which is what we should expect. Multiply all of those indexes together.

    Now, I think that if I start at the first one, and count ahead <sum> places, allowing for wrap, I should be able to stamp down an X at the first one, and again when I count <sum> more places, again, and again, and again, and have it just have filled in all four 'a's. Right? I donno, it's like 25,000,000,000,000 and some change, so it'll take a long ass time to find out. (GCD is what I'm thinking, I can't think right now, so that's probably wrong.)

    Anyway, if that's true, all we should need is:

    <byte value we're placing>, start position, skip value, number of 'a's in the file.

    You wouldn't want to do this for small files, because the "rebuild" file would take up more space than a small file.
    You actually wouldn't want to use it on huge files, because it would take forever to calculate the math to find out where everything ends up.

    But, for the sake of argument, let's pretend we can do it (the math, and the jumps) instantly. Time isn't anything we care about. Could you, given where to start, how far to jump, and how many jumps it takes, fill in all of the 'a's in a file this way? It seems right in theory.

    Edit: I think my math isn't working out with just multiplying them together. I'll fiddle with it later maybe. But I like the idea. It might be completely absurd, but I still like the idea.

    Quzah.
    Last edited by quzah; 09-30-2009 at 02:23 AM.
    Hope is the first step on the road to disappointment.

  8. #23
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by quzah View Post
    But, for the sake of argument, let's pretend we can do it (the math, and the jumps) instantly. Time isn't anything we care about. Could you, given where to start, how far to jump, and how many jumps it takes, fill in all of the 'a's in a file this way? It seems right in theory.


    Quzah.
    That's what I was trying to explain in my post, its not possible, because no matter what sequence of bits you use, even pi, or any other constant, there exists no sequence of bits shorter than 2^N that encodes every combination of N bits., and since it takes at LEAST ln 2^N/ln 2 (i.e. N) bits to index into a bitfield of length 2^N, you can never achieve compression of arbitrary data.

  9. #24
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    I don't know. You can make it easier if you allow higher values to overwrite lower ones; this way you only have to have 'perfect encoding' of roughly half. Ok let's say this is our file:

    01234869482348523198340

    I start zero filled, so I don't need to actually store zeros. All I need are 1s that fit:

    XY???????????????Y????X

    So say my 1s produce:

    X111???11???1??1?1??1?X

    My 2s need to produce:

    XXY???????Y????Y?X????X

    And so on. Until finally we have to get an exact:

    XXXXXXXYXXXXXXXXXXYXXXX

    The stored values don't have to know anything about the rules that got them there. I really don't know the math involved here, for the required number of bits and such, to be quite honest, it never interested me enough. But it does seem that you don't need to put down what hits you have to get in the file. You just need to put the value that provided those hits, and leave figuring out what it means to your decoder.

    See, even though I'm plotting 'a's in my file, it doesn't need to actually care anything about what it's plotting. All it needs to know is countX, countY times. Your de/encoder handles all the rules of checking if it can or cannot plot there. Hell you could even do that backwards. We know it's the hardest to get our last set, 9s in this example. Write them first. Now playing by the rule of "don't overwrite anything higher than you", plot in all the 8s. The run of 8s maybe includes stops on 9s. It counts them as having been placed, but it doesn't actually place anything. But to get to the final matching set of 8s, it can count some stops on 9s as valid stops along the way. We're still just saying "countX Y times".

    Ah well, all the fun stuff is impossible to do anyway.


    Quzah.
    Hope is the first step on the road to disappointment.

  10. #25
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Look I don't need an explanation as to what you mean, I know what you mean, I'm saying it won't work because it is mathematically impossible to produce ANY sequence small enough that it would take fewer bits to encode the INDEX into that sequence than the total number of subsequences it could possibly encode. Using random sequences, or perfectly formed sequences does not matter, the best you can EVER achieve mathematically using that method is parity because the INDEX must be AT LEAST as large as the total number of possible initial combinations, or you cannot uniquely represent every possible combination. That is, if you have 8 possible symbols to compress you must have at least 8 possible index values, which must all be unique. You cant encode 8 symbols with only 7 index values, its mathematically impossible, at least for lossless compression which text compression must be. The compression methods that work generally operate on the principle that with the raw data, you don't need to encode very possible sequence, for example, in war and peace, nowhere does there occur the sequence of letters 'Obama is a Muslim', therefore that combination can be discarded as a possible output, reducing the library size, and thus the index needed.

  11. #26
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    I don't need to encode the sequence. All I need to do is tell it to count X Y number of times. That's the sequence. If I tell you to count to 1000, 10 times, I don't need to have 10 bytes to tell you to count 10 times. I need 1 byte to tell you to count 10 times, and another N bytes to tell you what number to count to. What happens when you count that many times is handled by the program decoding that. That's my whole point. You count the same number a set number of times. That's all. The decryption program can take 100 petabytes for all I care.


    Quzah.
    Hope is the first step on the road to disappointment.

  12. #27
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by quzah View Post
    I don't need to encode the sequence. All I need to do is tell it to count X Y number of times. That's the sequence. If I tell you to count to 1000, 10 times, I don't need to have 10 bytes to tell you to count 10 times. I need 1 byte to tell you to count 10 times, and another N bytes to tell you what number to count to. What happens when you count that many times is handled by the program decoding that. That's my whole point. You count the same number a set number of times. That's all. The decryption program can take 100 petabytes for all I care.


    Quzah.
    No arbitrary sequence can be guaranteed to occur within N bits, with N being the size of your index, hence no compression.

    You are missing the point entirely. I'm not sayign you cannot FIND the sequence, and then produce an index that points to it, liek a pointer, I am saying that the Index, your 1000 tiem 10, will take more space than the original data.

    Sure, you can find a single case where it wont, but you will not find every possible sequence, hence arbitrary, within that index size, hence you cannot acheive arbitrary compression. You woudl be limited to only those sequences found within N bits, adn coudl never use any sequence that was not within N bits.

    To prove it to you -

    using pi

    3.14159265358979323846264338327950

    just encoding the single digits 0-9

    say you want to encode the number 7, you cannot find an index value lower than 14 that encodes it, the first instance of '7' is at place 14 hence you need ln 14/ln 2 bits to represent JUST the numbers 1-9, also adding 0 takes you out even further,

    since you can arbitrarily represent 10 symbols with ln 10/ln 2 bits, and index of size ln 14/ln 2 will ALWAYS AND FOREVER UNTIL THE END OF TIME be larger and thus take more space and achieve no compression. In fact, even if you form a PERFECT hand made sequence that reduces all redundancy the best you can achieve is parity.

    You cannot achieve lossless compression of arbitrary data with that method, it is impossible, no matte how many tricks you try, it is mathematically impossible. You might as well try to prove that 7 is more than 8.
    Last edited by abachler; 09-30-2009 at 06:18 PM.

  13. #28
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Well, 7 is larger than 8 for sufficiently large values of 7!


    Quzah.
    Hope is the first step on the road to disappointment.

  14. #29
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by quzah View Post
    Well, 7 is larger than 8 for sufficiently large values of 7!


    Quzah.
    I just lost all respect for you. I go through the effort of trying to explain one of the fundamental principles of information compression and all you can do is make jokes....

  15. #30
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by abachler View Post
    I just lost all respect for you. I go through the effort of trying to explain one of the fundamental principles of information compression and all you can do is make jokes....
    I'm sorry you have no sense of humor; that's a pretty standard mathematics joke. Frankly, I don't care if anyone here "respects" me or not. Honestly the opinions of "Random Internet Guy" mean nothing to me.

    But I was not following the math. That's not why I made the joke, it just fit right in with the "prove 7 is larger than 8". It was a given. It would have been remiss of me not to make the joke.

    Anyway, I can't see why if I can chain X numbers together by saying do Y Z times, that that must take more bytes. It doesn't make sense to me. If I can look at a sequence of numbers (indexes), and say, "Oh, hey, jump N places M times, and you'll hit all those squares!", why expressing doN*M must take up more bits than N values. Saying "Move 1000 spaces 10 times", does not up 10 bytes. It barely takes twice that for me to actually type "move 1000 spaces 10 times". So why is it that: sizeof( short ) + sizeof( char ) must be bigger than sizeof( char ) * 10? It isn't!


    Quzah.
    Hope is the first step on the road to disappointment.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Detecting Edit Control Text
    By yoonkwun in forum Windows Programming
    Replies: 2
    Last Post: 07-12-2009, 08:48 AM
  2. DirectX | Drawing text
    By gavra in forum Game Programming
    Replies: 4
    Last Post: 06-08-2009, 12:23 AM
  3. Small HTML question
    By Thantos in forum Tech Board
    Replies: 4
    Last Post: 12-29-2003, 12:37 AM
  4. WANTED: Contest Master
    By kermi3 in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-23-2003, 10:15 PM
  5. Ok, Structs, I need help I am not familiar with them
    By incognito in forum C++ Programming
    Replies: 7
    Last Post: 06-29-2002, 09:45 PM