Text Compression Contest

**abachler** · 09-30-2009

then by all means show me a sequence of fewer than 10 digits that contains every number from 0-9. Because that is what you are claiming is possible. Your index will always take more bits, because you cant say well, I found 14159 at index 2 so my index only takes 2 bits. Because you cannot find arbitrary sequences, that means any sequence I happen to need compressed, such as 27894 using only a 2 bit index, therefore your index must mathematically be large enough to contain an index sufficient to find ALL possible sequences of say 5 numbers. Plus you have to somehow encode the run length, either by having all symbols be of fixed length, or by encoding it as part of the index. Even in those cases you can never achieve better than parity. If you had actually read my previous posts you would understand why.

**quzah** · 09-30-2009

It's not that I'm counting every digit from 0 to 9. It's that I'm counting X of Y digit by counting Z moves. Here, let's craft a file, it's not going to be compression at this point, because using a small file costs you (go .zip an empty text file, and the header makes the file bigger than it would be if you didn't 'compress' it). But it should prove the idea I'm trying to get across.

<some random size>
10 7
20 8
33 145
5 128
7 345
8 1
11 34
17 256
89 12045

There's our file. We don't store zeros, because we zero fill the output file first, <some random size> bytes.

Next, we call function f, 10 times, with 7 as its argument.

Code:

for( x = 0; x < y; x++ )
    f(7);

Then we call it 20 times, with 8.
Then 33 times with 145.
And so on.

Now, what we've done is popped through all the spots in our output file, and in the end, have possibly written the sum of column B number of bytes, where the byte written is dependent on where we were in the file at the time we read the instructions. For example, f(7), was our 1s. f(8) was our 2s. And so on.

I don't care about storing 0-9 in less than 10 digits. I don't need to. All I'm writing an output file based on a predetermined series of described jumps. That's all. That's the "compression".

It's the same idea as RLE. RLE says, "write out X of Y". That's all. That's what RLE does. This is the same thing, except how we obtain what we write is far more complex to derive. Once you figure the math of how far you have to jump, and how many jumps you need, it's output file is smaller than its input file, assuming you don't do something stupid, and try to "zip up a zero byte file".

Yes, there is "header overhead". Yeah, I'm going to have a line for each value I want to write. Yeah, each spot is going to have how many times I try to write, and how far I jump before the next write. But, beyond that, I'm not seeing why it's guaranteed to be bigger than what I'm writing.

The resource consuming part is finding out what number you have to have to describe your jumps, how many and how far. It's like searching for prime numbers. You know there are prime numbers. You just don't know where the next one is. It's there some place.

Think of it as how you can grab N numbers and find their GCD, LCD, GCF, WTF. That's there, you can compute it. Then the idea is that given N numbers, you should be able to describe a relationship between those numbers where X moves of Y in a boundary of Z.

Maybe I'm way off, but I'm not seeing it.

[edit]
I suppose I could describe the file's size by replacing the top line with: <size> 1, which would say "write 0, size times, in jumps of 1." (Just to make all of your reads standard...

Code:

for( x = 0; fgets( .. ) != EOF; x++ )
{
    sscanf count and jumps
    for( ... )
        f(jumps,x)

Something to that effect.
[edit]

Quzah.

**abachler** · 09-30-2009

Go back and actually read my post, it explains it, you obviously didn't read them or you wouldn't keep insisting you can represent 2^N unique symbols in fewer than N bits.

For the purposes of your method each 'symbol' is a unique position in teh sequence i.e. pi or whatever you use. Each unique value to encode must have a UNIQUE position in pi, and therefore a UNIQUE index. For lossless compression there must be a ONE TO ONE mapping from compressed data to uncompressed data. therefore once again -

YOU CANNOT REPRESENT 2^N UNIQUE SYMBOLS WITH AN INDEX SIZE OF FEWER THAN N BITS.

It doesn't matter how you represent your index, as a straight number, or as some multiplication problem, it will still occupy at least as many bits as the original data.
If your index is 3 bits you can only represent AT MOST 8 positions in the sequence, meaning you can encode AT MOST 8 symbols, meaning you achieve ZERO COMPRESSION, ZIP NADA NOTHING.

If you cannot understand that, then I'm about to start spoon feeding you by making you answer question like, do you even know what a bit is?

**quzah** · 09-30-2009

I'm not compressing the index of 0-9 lower than 10. I'm compressing the NUMBER OF EACH. Watch, I can compress 500 integers into 4 characters. Ready? Really ready for this? Because I don't think you are, but I'm going to do it anyway!

Code:

2,500

There. See that? That's four bytes if we ignore the separator. Want to know what I just compressed? All the multiples of 2 up to 1000.

That's pretty neat huh? I just compressed every multiple of 2, up to 1000 into five characters. Want to know how I generate the output file for that? I run that through a ........ing function!

Code:

fun(2,500)

BAM!

Do you understand yet?

Quzah.

**abachler** · 09-30-2009

Originally Posted by quzah

I'm not compressing the index of 0-9 lower than 10. I'm compressing the NUMBER OF EACH. Watch, I can compress 500 integers into 4 characters. Ready? Really ready for this? Because I don't think you are, but I'm going to do it anyway!

Code:

2,500

There. See that? That's four bytes if we ignore the separator. Want to know what I just compressed? All the multiples of 2 up to 1000.

That's pretty neat huh? I just compressed every multiple of 2, up to 1000 into five characters. Want to know how I generate the output file for that? I run that through a ........ing function!

Code:

fun(2,500)

BAM!

Do you understand yet?

Quzah.

now compress a random sequence of NOT 500 same symbols in a row, yeah, your theory doesn't work so well with ARBITRARY data does it.

**quzah** · 09-30-2009

That's my whole point! Just like you can find the least common denominator, you should be able to describe the relationship between N numbers, where the distance between them is described as X moves of Y.

I said from the start that it might be impossible, but that in theory I think you should be able to describe it as such. Consider hashing. Ideally, you want no collisions. You want a number that when you drop it in the bucket, fills each bucket equally. But the numbers you are dropping in are random pretty much. So what happens? You pick a prime number, or something fun, and hop around in your buckets and fill them up. It's kinda the same idea.

I didn't say *I* could do the math for it. That's why I said you'd need a quantum computer to calculate the math involved.

I want to hit spots 1 5 and 8 out of 11 buckets. All of those numbers were chosen randomly.

ABCDEFGHIJK
1...5..8...

I describe these moves as "starting at zero, move x--, 2 times". That fits, right?

04--2

Hey, I'm only 1 byte off from being even! What if I assume that -- is a given? Well now I'm:

042

What if I assume that everything starts at 0?

Now I'm at:

42

I've just just compressed something, because I could see a relationship between those random numbers. The problem is that when you're talking about a huge number of them, it scales really badly. I couldn't tell you the relationship between 50 random numbers. But in theory, there should be one.

I can't do the math for it. But the theory is that there's a sequence of numbers where by moving X moves of Y you will hit each of those spots in an array of Z elements, wrapping if needed, ingoreing stuff you stamp over, if needed, writing over what you stamp over, if needed. You could describe the rules in the function, and the output file doesn't need to know those rules necessarily. That's what I was trying to illustrate above. It would be fun to be able to run through ... digits of pie, rand(seed(x)), whatever, and have your program spit out relationship jumps. That's what my first post was all about.

Quzah.

**quzah** · 09-30-2009

Originally Posted by abachler

now compress a random sequence of NOT 500 same symbols in a row, yeah, your theory doesn't work so well with ARBITRARY data does it.

Who cares about arbitrary? You said you couldn't compress N in less than N/FOO bytes. I just did. Maybe you didn't express it clearly enough, specifying that it had to be arbitrary, but the fact is, you CAN describe how to compress N values in less than N bytes of description, as I've just proven with 2,500.

And yes, we're primarily talking about arbitrary numbers. But you repeatedly said you CAN NEVER ... and well, yes, you can.

Quzah.

**quzah** · 10-01-2009

The only reason numbers are random is because you can't see a connection between them.

Quzah.

**abachler** · 10-01-2009

Originally Posted by quzah

Who cares about arbitrary? You said you couldn't compress N in less than N/FOO bytes. I just did. Maybe you didn't express it clearly enough, specifying that it had to be arbitrary, but the fact is, you CAN describe how to compress N values in less than N bytes of description, as I've just proven with 2,500.

And yes, we're primarily talking about arbitrary numbers. But you repeatedly said you CAN NEVER ... and well, yes, you can.

Quzah.

I believe I did express that fact, in big bold all capital underlined letters. I'm sorry you missed it. Here it is again -

Originally Posted by He who must have a masochistic urge to spoon feed the noob

YOU CANNOT REPRESENT 2^N UNIQUE SYMBOLS WITH AN INDEX SIZE OF FEWER THAN N BITS.

I even added italics this time so maybe you wont miss it.

**Mario F.** · 10-01-2009

It's irrelevant. You don't need to compress the whole file.

**quzah** · 10-01-2009

Originally Posted by abachler

I believe I did express that fact, in big bold all capital underlined letters. I'm sorry you missed it. Here it is again -

I even added italics this time so maybe you wont miss it.

Um yeah, except you missed the whole part where I JUST DID with my three characters. It's not "N unique characters in N bits". I'm not trying to compress 3 unique characters to 3 BITS. I'm compressing three characters to 2 characters. I don't see how what you keep shouting is relevant. Explain it clearer than "3 unique characters in less than 3 BITS", because I'm not trying to compress 3 unique characters in 3 BITS. I'm compressing 3 characters in 2 bytes. You're saying I'm trying to do:
1,5,8 = 11 (bits, both set). I'm not. I'm not trying to fit three characters into three BITS.

I'm fitting 3 characters in TWO BYTES. That's compression. Compressing three characters into two is compressing it. I'm not sure how that has anything to do with whatever it is you're babbling on about.

I don't have to compress unique characters into smaller bits. I'm compressing the number of X I find into less than X * sizeof( X ) bytes. Other than that, I'm not sure what you're trying to get at. It's not that I'm not reading it, it's that just because you make it bold, doesn't mean your explanation is clear.

If I take a file, and compress all of the 'a' characters in that file into some nifty RLE-esque sequence, that says "start here, skip N places, and wherever you land, place an 'a'", and that's all I do, and my description of "start here, skip..." takes up less than 'a' * total times 'a' appears in the file, then I have just COMPRESSED IT!

I don't care whatever you're babbling about fitting 'a' into 1 bit. I'm not trying to do that. No one is trying to fit the numbers 0 - 9 into less than 10 bits (Oh, FYI, 0-9 (10 digits) is actually 4 bits, so what ... it's automatically compressing itself, because 4 is less than 10!? Seriously, instead of shouting over and over that the sky is green, tell me what makes it green, and how that relates to me saying there's a moon.)

Edit: As Mario said, unless he's talking to me, in which case, I don't understand the point either of you are making: I don't have to compress every aspect of a file to achieve compression. I can compress one part of it, and if I describe that expression in less bytes than what I'm compressing, then YES I HAVE JUST ACHIEVED COMPRESSION.

So, if you're planning on trying to explain this again, do it like you're talking to someone about someone who has never heard of math. Because the way you are saying it doesn't make any sense to me.

Quzah.

**abachler** · 10-01-2009

Originally Posted by quzah

So, if you're planning on trying to explain this again, do it like you're talking to someone about someone who has never heard of math. Because the way you are saying it doesn't make any sense to me.

Quzah.

Explain to us what you think a bit is.

**quzah** · 10-01-2009

Think of it this way: If I look at a file, and say: "Hey, every N characters, there is an 'a'!", and so I remove every N characters and put at the top of the file (or the end, whatever):

'a',N

And when I "uncompress the file", I simply do:

Code:

for( x = 0; (c = fgetc( in )) != EOF; x++ )
{
    if( x % N == 0 )
    {
        fputc( out, 'a' )
    }
    fputc( out, c );
}

Then no matter what you babble on about N in N bits, it doesn't matter. I've just uncompressed this file. If you can say that by doing f(N), it produces an output, and describing f(N) in the file takes up less space than all of N, then you've just achieved compression. You don't have to compress every single different unique character. All you have to do is compress the total number of occurrences of X into a description of f(N), where that description takes up less space in the file than the total number of times X appears, and you've just achieved compression.

None of your bits-vs-bytes-vs-unique characters matters. If it does, you're TERRIBLE at explaining how. Because I can PROVE that saying f(N) takes up less space than all of the number of X in a file.

Quzah.

**bithub** · 10-01-2009

You guys are both right, yet are completely misunderstanding each other.

Quzah's technique will work. It will not work with completely arbitrary, random data. But it will work with data that follows some sort of pattern. The trick is figuring out what the pattern is.

Abachler is saying that Quzah's technique will not work with completely arbitrary, random data. This is true, because by definition, random data doesn't have a pattern. Can anything be completely random though?

See, you're both right.

**quzah** · 10-01-2009

Originally Posted by abachler

Explain to us what you think a bit is.

It's the internal representation of a value that means either 1 or 0. Combine two of them, and you can now count to four. Combine three of them, and you can count to eight. Combine four of them, and you can now count to sixteen. Etc.

Let's say that every N bits in a file is 1. That pattern, 1 appearing every N bits, in a file can be expressed as f(N). I say ... anywhere ... f(N). Me saying f(N) takes up 32 bits (assuming I'm expressing it in clear actual readable text, f parenthesis N parentheis). If N is bigger than 32, then I have just compressed my file.

It doesn't matter if to uncompress my file if I have to use all the computers in the world, and all the power produced by all the power plants in the world to computer that decompression. None of that matters. If my output file is smaller than my input file, that's compression. Period.

Quzah.