Thread: Huffman Compression Program

  1. #1
    Chad Johnson
    Join Date
    May 2004
    Posts
    154

    Question Huffman Compression Program

    I wrote a [C++] program that uses the Huffman compression to compress a text file, and I found that it only compresses a 40K file to about 24K, whereas Winzip can compress it to 9K. I haven't even stored the tree in the file.

    Right now, my program inputs the characters from the file and creates a dictionary then a frequency table. I then create a tree out of every character in the dictionary, place them in a priority queue, and then I continually remove and combine the two smallest trees until I only have one tree left. Then I traverse the tree in preorder to create the codes (left=0, right=1). After that, I go through all the characters in the file and get their codes, combine the codes into a string and convert each set of eight characters from binary to decimal. Finally I output each decimal value to a binary file.

    Also, my program can only handle ANSI files, and it cannot deal with files that contain Unicode characters (the program can only handle characters in the ASCII table). How could I handle Unicode characters?

    I did not include the source because it is about six or seven pages long. I can e-mail it to whoever helps me.

  2. #2
    pronounced 'fib' FillYourBrain's Avatar
    Join Date
    Aug 2002
    Posts
    2,297
    huffman is a good final step to a compression scheme. It mangles it up to the point where you can't compress it any further. If you want better compression you should chain together a couple compression methods.

    Run-length encoding is really basic handling of duplication
    LZW (or a variety of it) is a much more advanced method of using duplicate patterns to your advantage

    Basically, run some sort of pattern duplication based compression algorithm on the file first. Then the huffman. Your final size will be much smaller.
    "You are stupid! You are stupid! Oh, and don't forget, you are STUPID!" - Dexter

  3. #3
    pronounced 'fib' FillYourBrain's Avatar
    Join Date
    Aug 2002
    Posts
    2,297
    didn't answer your unicode question.

    unicode is nothing more than 16 bits to store a letter. This translates to a file full of bytes. Bytes are a potential 256 possible outcomes. You can do huffman frequency tables with 256 entries just as easily as you can for the ascii set. This will provide you with binary compression.
    "You are stupid! You are stupid! Oh, and don't forget, you are STUPID!" - Dexter

  4. #4
    Chad Johnson
    Join Date
    May 2004
    Posts
    154
    But how come my program didn't compress the file as well as Winzip did? Doesn't the zip format only use Huffman compression?

    Back to Unicode, basically, I know how to create a frequency table for characters that appear in the ASCII table that I come across as I go through each character in my file. I use the ASCII code as the index for the characters in the dictionary array.

    You know how you can use integers for type char in C++? for example

    char c = 76;

    would give you the letter 'L' for the variable c. If I want to store a Unicode character in my dictionary (which would not be in the ASCII chart), how would I do that? I can't say

    char c = 706;

    and get a Unicode character because type char only knows about the ASCII table, or am I wrong?

  5. #5
    pronounced 'fib' FillYourBrain's Avatar
    Join Date
    Aug 2002
    Posts
    2,297
    Quote Originally Posted by ChadJohnson
    But how come my program didn't compress the file as well as Winzip did? Doesn't the zip format only use Huffman compression?
    No, I'm pretty sure it does a series of algorithms. Including LZW.

    Quote Originally Posted by ChadJohnson
    Back to Unicode, basically, I know how to create a frequency table for characters that appear in the ASCII table that I come across as I go through each character in my file. I use the ASCII code as the index for the characters in the dictionary array.

    You know how you can use integers for type char in C++? for example

    char c = 76;

    would give you the letter 'L' for the variable c. If I want to store a Unicode character in my dictionary (which would not be in the ASCII chart), how would I do that? I can't say

    char c = 706;

    and get a Unicode character because type char only knows about the ASCII table, or am I wrong?
    basically, all things are bytes. You should read in the source file as a buffer of bytes. It may contain unicode, but you don't care. The repetition that makes huffman work will occur on the byte level. So read in a buffer:

    unsigned char buf[MAX_BUFFER_SIZE];

    from the file. go one byte at a time.
    "You are stupid! You are stupid! Oh, and don't forget, you are STUPID!" - Dexter

  6. #6
    pronounced 'fib' FillYourBrain's Avatar
    Join Date
    Aug 2002
    Posts
    2,297
    what I'm saying about "at the byte level" is this. You're talking about cramming a 16 bit unicode character into a char. What I'm saying is that it gets split into 8 bits each. so your example

    char c = 706;

    is not right. It would be more like this:

    unsigned char c1 = (706 >> 8);
    unsigned char c2 = (706 & 0x00FF);
    "You are stupid! You are stupid! Oh, and don't forget, you are STUPID!" - Dexter

  7. #7
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    >Doesn't the zip format only use Huffman compression?
    It uses a hybrid of LZ77 and Huffman, called Deflate. The algorithm is public domain IIRC, so you should be able to download reams of info on it.
    My best code is written with the delete key.

  8. #8
    Registered User CompiledMonkey's Avatar
    Join Date
    Feb 2002
    Location
    Richmond, VA
    Posts
    438
    Does anybody know which algorithm WinRAR uses?

  9. #9
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    C provides the type wchar_t type to store a wide character.
    Code:
    wchar_t wc = 706;
    It may be different lengths on different platforms, typically 2 or 4 bytes.

  10. #10
    Chad Johnson
    Join Date
    May 2004
    Posts
    154

    Unicode

    Thanks for the info.

    I tried running a program with the following code:

    unsigned char c;
    int i;

    for (i=0; i< 1000; i++)
    {
    c = (i & 0x00FF);
    printf ("%c ", c);
    }


    and when I ran it, it simply printed the ascii table over and over and over.

    How will appending 256 in hex (& 0x00FF) to the variable do anything?

    If I give the char variable c a value of 76, it would be exactly the same as 332. The ascii table just repeats (% 256). How can I store Unicode characters and actually print them to the command line?

  11. #11
    Chad Johnson
    Join Date
    May 2004
    Posts
    154

    thanks

    ok, i guess i was writing that last question at the same time that you were writing a response. I'll try that type. thanks!

  12. #12
    Registered User
    Join Date
    Oct 2002
    Posts
    291

    Post

    Quote Originally Posted by ChadJohnson
    If I give the char variable c a value of 76, it would be exactly the same as 332. The ascii table just repeats (% 256). How can I store Unicode characters and actually print them to the command line?

    332 = 101001100 (in binary).
    you then & the value with 0xff which leaves you with the binary number 01001100 (decimal value of 76). The & operation bascilly truncates the most significant bit.

  13. #13
    Registered User
    Join Date
    Jan 2005
    Posts
    2

    C Program in Arm Processor

    I want to write a c program for an arm processor. The program will read english letter from registers in specific memory address, "translate" them according Huffman coding and write them in some output registers in the arm processor.

    Can anyone help me!
    Thanks!

  14. #14
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Well you can STOP bumping a load of threads with nothing more than "please help"
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Need help with a program, theres something in it for you
    By engstudent363 in forum C Programming
    Replies: 1
    Last Post: 02-29-2008, 01:41 PM
  2. Replies: 4
    Last Post: 02-21-2008, 10:39 AM
  3. Replies: 3
    Last Post: 03-04-2005, 02:46 PM
  4. My program, anyhelp
    By @licomb in forum C Programming
    Replies: 14
    Last Post: 08-14-2001, 10:04 PM