Thread: Simple compression algorithm?

  1. #1
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229

    Simple compression algorithm?

    Can anyone suggest a simple compression algorithm that I can code? I am going to use it to compress wikipedia dump (text only, in XML). Uncompressed it is ~2.4GB, and Bzip2 can compress it down to ~200MB. I cannot just link to an existing library, however, as I am planning to write a java program for cell phones to uncompress and read the dump. Implementing Bzip2 in java for the cell phone doesn't sound too inviting...

    My requirements -
    Compressing speed - not important
    Decompressing speed - important
    Compression ratio - something like 2.4GB -> 500MB would be fine

    as said above, it will be used to compress human-readable text data (wikipedia articles)

    Thanks!

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Uhhh...

    Googling for "java bzip2" gets a good result on the first hit.

  3. #3
    Registered User
    Join Date
    May 2008
    Posts
    30
    The only way I have learned to do compression is with binary tress. What you end up doing is assigning each value to a bit array then you just store each value in a table

    example

    a lazy dog

    l = 1
    z = 1
    y = 1
    d = 1
    o = 1
    g = 1
    a = 2

    then

    you take the first two and make the binary tree out of it with a value of 2.

    then move on the next two till the end.

    Then you put the trees in order.
    then you start putting the trees together two at a time just as you did with the letters
    you do this till you have one tree.

    then you work down the tree assigning values to each letter
    then you just put each value of letter into the text file in order with the table of what each letter is at the beginning.

    Sorry this isn't very clear.
    Hope this helps you do all of this recursively. unfortunately we never actually got into details on how to code this.

  4. #4
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Googling for "java bzip2" gets a good result on the first hit.
    Good idea =). I don't know if a cell phone would be able to handle bzip2 though (it has to be reasonably fast). That is why I was looking for a simpler algorithm, as I don't need the compression ratio of bzip2.

    As for the binary tree, I will certainly look it up. Thanks

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    bzip2 should be fine on a mobile phone, but another algorithm that is a bit less CPU intensive, and works really well with text (e.g. XML) files is "bytepair compression". Google that, and I'm sure you can find a Java implementation.

    Note that this is not quite as good as bzip2.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Deflate (GZip) is also quite good and a good deal faster than bzip2. May be a bit tricky finding an implementation, though, since the standard edition supports it out of the box. Anyway, moving to tech, since you're apparently writing a Java program.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Produce a concordance of all the words in the file, then determine a 'top 100' (or whatever) of length * frequency, and replace those words with some kind of byte code.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Salem View Post
    Produce a concordance of all the words in the file, then determine a 'top 100' (or whatever) of length * frequency, and replace those words with some kind of byte code.
    Good idea. It works similar to bytepair encoding that I suggested before, but on full words instead of pairs of bytes. Both methods are pretty simple to implement, and should be fairly fast.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by Zosden View Post
    you take the first two and make the binary tree out of it with a value of 2.

    then move on the next two till the end.
    You could at least give the proper name of this technique, both for credit's sake and to make it easier for the OP to search for references. It's called a "Huffman tree," the codes so generated are "Huffman codes."

    Huffman compression is extremely easy to implement (you don't have to write the code which constructs the tree -- just precompute it based on some statistical data, there are even trees out on the web pre-made for you), and gets huge gains with little effort.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Need help finding a simple 'shortest path' algorithm
    By ashley in forum C Programming
    Replies: 2
    Last Post: 04-19-2007, 12:38 PM
  2. Simple simple graphics
    By triplem in forum C Programming
    Replies: 2
    Last Post: 05-19-2003, 02:52 AM
  3. simple hashing algorithm??
    By iori in forum C Programming
    Replies: 7
    Last Post: 04-14-2003, 05:18 PM
  4. a simple algorithm and questions
    By ustuzou in forum C++ Programming
    Replies: 0
    Last Post: 02-18-2002, 11:12 AM
  5. Simple File Creation Algorithm
    By muffin in forum C Programming
    Replies: 13
    Last Post: 08-24-2001, 03:28 PM