Couple questions/Huffman code

This is a discussion on Couple questions/Huffman code within the C++ Programming forums, part of the General Programming Boards category; I have a couple questions. 1) So, I understand that the Huffman compression method creates a binary tree depending on ...

  1. #1
    Registered User Cpro's Avatar
    Join Date
    Oct 2006
    Posts
    149

    Couple questions/Huffman code

    I have a couple questions.

    1) So, I understand that the Huffman compression method creates a binary tree depending on the number of occurrences/probability of a character (in my case). The higher the probability, the closer it is to the top of the tree. Each character in the tree is associated with a code, depending on its location in the tree, and this tree can be transformed into a table. Say, for example, you have this line being read from a file "AABC". So, I would count the frequency/probability of each letter and put it in a binary tree. So the table might look like:
    [code]
    A 2 010
    B 1 000
    C 1 1010
    [code]
    Here is where I'm a little confused.
    When I output the compressed file it would look like:
    Code:
    010 010 000 1010
    But how is this compressed? There are more characters now than there was in the beginning. So, if I read in the "compressed" file, i would be reading in 13 characters instead of the original four.

    2) How come i can do:
    Code:
    char name[15] = "David";
    but not:
    Code:
    char name[15];
    name = "David";
    Same here:
    Code:
    class Person
    {
    public:
    	string LastName;
    	char FirstName [11];
    	Person();
    };
    
    I can do this with a string: 
    person1.LastName = "David";
    but not with the char:
    person1.Address = "29th";
    I want to be able to declare the char array by itself, and then assign it a value later on. It must be a char array of a fixed length.
    IDE - Visual Studio 2005
    Windows XP Pro

  2. #2
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    no, the 0 and 1 are bits in the sequence, not individual characters (ascii). so you took 3 ascii characters (24 bits) and compressed it to 13 bits, which is just under 2 bytes for a 45.8% compression. You must be using a suboptimal huffman algorithm, because huffman would produce for AABC

    A 2 0
    B 1 10
    C 1 11

    with tthe result being 001011 or 7 bits for 78.1% compression
    Last edited by abachler; 07-01-2008 at 01:42 PM.
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

  3. #3
    Registered User Cpro's Avatar
    Join Date
    Oct 2006
    Posts
    149
    Then, do i read the compressed file differently than the non-compressed?

    For example, when reading from the un-compressed (AABC):
    Code:
    char letter;
    while(1)
    {
         inFile >> letter;
         if(inFile.fail()) break;
    }
    So, this would read four (five if you include the break) times.

    But when reading from the compressed file (001011)
    Code:
    int n;
    while(1)
    {
         inFile >> n;
         if(inFile.fail()) break;
    }
    This would read 6 (7) times, which is more than the original.

    "You must be using a suboptimal huffman algorithm, because huffman would produce for AABC"
    I just made the numbers up.

    Thanks.
    IDE - Visual Studio 2005
    Windows XP Pro

  4. #4
    Registered User
    Join Date
    Jul 2003
    Posts
    110
    I think you are confusing binary and text modes here.

    You aren't going to write out the character '1' and '0', you are going to set bits in a bitmap (basically) and write the bitmap out.

    Let's assume that you are using 8-bit chars for this.

    The 8-bit code for 'A' uses 8-bits, or one char of space in the file. So AABC would take your four chars. The huffman encoded file, like abachler was trying to point out to you, contains 001011 which uses 7-bits, or slightly *less* than one char of space in the file. You went from 4 to 1.

    So Yes! You read the compressed file differently than the non-compressed file. Does that help clear it up a little more?


    Also, you can't use simple assignment for arrays (as in your second part of your original post). You would have to use a function like std::strcpy.

  5. #5
    Registered User Cpro's Avatar
    Join Date
    Oct 2006
    Posts
    149
    "You aren't going to write out the character '1' and '0', you are going to set bits in a bitmap (basically) and write the bitmap out."

    Is there a link to an example on how to do this and/or what the compressed output file looks like? I tried searching, but I'm not exactly sure what i'm looking for. Creating the tree and code for each character should be no problem, but i'm not sure how to write the code (bitmap) in the output file.
    Everything i've read about Huffman encoding explains how to create the tree and table, but not how to output/create the compressed file from that information.

    Thanks.
    IDE - Visual Studio 2005
    Windows XP Pro

  6. #6
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    There are 2 standard methods.

    Bitfields. C/C++ let you directly manipulate the bitfields in a structure

    Code:
    struct BitField {
       a : 1;
       b : 1;
       c : 1;
       d : 1;
       e : 1;
       f : 1;
       g : 1;
       h : 1;
       } MyField;
    creates a structure with members a,b,c,d which are each 1 bit in length. So setting MyField.a = 1; sets the bit associated with a to 1. MyField would take up a single byte only.

    The other method, and the one I use most, is to do bit shifting. This method is easier to impliment in assembly. It is also easier to impliment in assembly than in C/C++, at least if you are comfortable with C/C++/ASM.

    Another thing you need to deal with is the library for your compressed output. When you read the compressed file in, your decompressor needs to know the translation to use. For example, an input of AABC and an input of DDFG woudl both produce the same output code (001011) but would have different symbols associated with 0 , 10, and 11. So your output file has to define what each compressed symbol means so the decompressor can produce the correct output.
    Last edited by abachler; 07-02-2008 at 09:04 AM.
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

  7. #7
    Registered User Cpro's Avatar
    Join Date
    Oct 2006
    Posts
    149
    Ok, let me see if I understand this.
    Code:
    struct BitField
    {
         unsigned int a : 1;
         unsigned int b : 1;
         unsigned int c : 1;
         unsigned int d : 1;
    }MyField;
    So, using the example "AABC" with the code:
    Code:
    A 2 0
    B 1 10
    C 1 11
    Then would i do:
    Code:
    MyField.a = 0;
    MyField.b = 10;
    MyField.c = 11;
    But, i would also have to change the number of bits allocated to c & d to two (in the struct)?
    Then would i just output it to the file like:
    Code:
    cout << MyField.a << MyField.a << MyField.b << MyField.c;
    IDE - Visual Studio 2005
    Windows XP Pro

  8. #8
    Registered User
    Join Date
    Jul 2003
    Posts
    110
    A bit of advice, don't use bitfields for this. They aren't really going to get very good mileage for this problem. But no, your usage is wrong. The bitfield member will get promoted to int when it gets written IIRC*, so you are going to have to go the bitmap route anyway.

    I hesitate to post what I do, because it's pretty bad form in C++. I'm an old C programmer though, so I'm used to the abuse. See if this gives you an idea for a class you could make:

    C FAQ on arrays of bits

    Err, wait! I think C++ has std::bitset IIRC. Check that out first, but make sure you can get a bitmap back out of it.

    So you will need to make an "array of bits" large enough to hold your entire compressed file, and simply walk down it setting and clearing bits as you encode the file. Then, you write the table to the file, followed by the bitmap. Then, when you go to read the file again, you read the table first, in order to be able to decode the file that you will read next.

    HTH!

    *I don't have any resources on hand at the moment to reference this stuff, so I'm relying totally on memory. I'm almost never 100% correct from memory anymore, but the general idea is right.

  9. #9
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,046
    You might run into issues with structure padding by using bitfields, as well.

    You could use something like this . . . .
    Code:
    int set_bit(int num, int bit, int to) {
        if(to) num |= (1 << bit);
        else num ^= (1 << bit);
        return num;
    }
    
    int get_bit(int num, int bit) {
        return num & (1 << bit);
    }
    I think I have that right . . . .
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  10. #10
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    if promotion is an issue, you could use #pragma pack(1). That will guarantee byte alignment and sizeing.
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

  11. #11
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,046
    That's for MSVC only, of course.

    Recent thread about structure alignment: Structural Padding??
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  12. #12
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    Well, as I said, I usually use bit shifting in assembly, so i have direct control over the sizing issue.
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Binary Search Trees Part III
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 16
    Last Post: 10-02-2004, 03:00 PM
  2. Math function code - couple probs, tips?
    By Captain Penguin in forum C++ Programming
    Replies: 4
    Last Post: 06-18-2002, 09:54 AM
  3. Interface Question
    By smog890 in forum C Programming
    Replies: 11
    Last Post: 06-03-2002, 05:06 PM
  4. Couple Code Snippets
    By Unregistered in forum Game Programming
    Replies: 1
    Last Post: 01-22-2002, 01:23 AM
  5. Replies: 4
    Last Post: 01-15-2002, 11:04 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21