Simple compression algorithm? [Archive] - C Board

PDA

View Full Version : Simple compression algorithm?


cyberfish
05-01-2008, 07:45 PM
Can anyone suggest a simple compression algorithm that I can code? I am going to use it to compress wikipedia dump (text only, in XML). Uncompressed it is ~2.4GB, and Bzip2 can compress it down to ~200MB. I cannot just link to an existing library, however, as I am planning to write a java program for cell phones to uncompress and read the dump. Implementing Bzip2 in java for the cell phone doesn't sound too inviting...

My requirements -
Compressing speed - not important
Decompressing speed - important
Compression ratio - something like 2.4GB -> 500MB would be fine

as said above, it will be used to compress human-readable text data (wikipedia articles)

Thanks!

brewbuck
05-01-2008, 08:36 PM
Uhhh...

Googling for "java bzip2" gets a good result on the first hit.

Zosden
05-01-2008, 08:47 PM
The only way I have learned to do compression is with binary tress. What you end up doing is assigning each value to a bit array then you just store each value in a table

example

a lazy dog

l = 1
z = 1
y = 1
d = 1
o = 1
g = 1
a = 2

then

you take the first two and make the binary tree out of it with a value of 2.

then move on the next two till the end.

Then you put the trees in order.
then you start putting the trees together two at a time just as you did with the letters
you do this till you have one tree.

then you work down the tree assigning values to each letter
then you just put each value of letter into the text file in order with the table of what each letter is at the beginning.

Sorry this isn't very clear.
Hope this helps you do all of this recursively. unfortunately we never actually got into details on how to code this.

cyberfish
05-01-2008, 09:39 PM
Googling for "java bzip2" gets a good result on the first hit.

Good idea =). I don't know if a cell phone would be able to handle bzip2 though (it has to be reasonably fast). That is why I was looking for a simpler algorithm, as I don't need the compression ratio of bzip2.

As for the binary tree, I will certainly look it up. Thanks

matsp
05-02-2008, 02:12 AM
bzip2 should be fine on a mobile phone, but another algorithm that is a bit less CPU intensive, and works really well with text (e.g. XML) files is "bytepair compression". Google that, and I'm sure you can find a Java implementation.

Note that this is not quite as good as bzip2.

--
Mats

CornedBee
05-02-2008, 03:35 AM
Deflate (GZip) is also quite good and a good deal faster than bzip2. May be a bit tricky finding an implementation, though, since the standard edition supports it out of the box. Anyway, moving to tech, since you're apparently writing a Java program.

Salem
05-02-2008, 06:33 AM
Produce a concordance of all the words in the file, then determine a 'top 100' (or whatever) of length * frequency, and replace those words with some kind of byte code.

matsp
05-02-2008, 06:37 AM
Produce a concordance of all the words in the file, then determine a 'top 100' (or whatever) of length * frequency, and replace those words with some kind of byte code.

Good idea. It works similar to bytepair encoding that I suggested before, but on full words instead of pairs of bytes. Both methods are pretty simple to implement, and should be fairly fast.

--
Mats

brewbuck
05-02-2008, 09:10 AM
you take the first two and make the binary tree out of it with a value of 2.

then move on the next two till the end.

You could at least give the proper name of this technique, both for credit's sake and to make it easier for the OP to search for references. It's called a "Huffman tree," the codes so generated are "Huffman codes."

Huffman compression is extremely easy to implement (you don't have to write the code which constructs the tree -- just precompute it based on some statistical data, there are even trees out on the web pre-made for you), and gets huge gains with little effort.