Thread: Huffman Encoding

  1. #1
    Unregistered
    Guest

    Huffman Encoding

    I'm a bit confused on how to encode an actual tree based on the Huffman Compression scheme (compressing a text file based on the frequency of characters in a text file. Would anyone be able to point me to some sample code? I understand the principles behind it, but writing the code what I am having trouble with.

    Thank you.

  2. #2
    Former Member
    Join Date
    Oct 2001
    Posts
    955
    I can try and help you, but what kind of tree you wanna encode?

    Oskilian

  3. #3
    Unregistered
    Guest
    I'd like to encode 3 trees, with 58 elements in all. 2 26 alphabet (upper-case and lower-case) trees and 9 punctuation symbols (comma, period, backslash, forward-slash, hyphen, semicolon, colon, quotation mark, and apostrophe).

    For what I have right now is a couple functions scanning a text and counting the frequency of characters in an array with the same amount of elements (58). After the proper steps, an array gives the ranking of each element with a numebr. Therefore, freqarray[0] = 1 would mean that 'A' is second in the rankings.

    Hope that helps.

  4. #4
    Registered User
    Join Date
    Sep 2001
    Posts
    752
    If I first might make a suggestion,
    The Compression Book by Mark Nelson and Jean-Loup Gailly is a book I've been looking at lately, and it does have source for pretty much everything.

    I'm going to umm, paraphrase their code now, basically an ineffecient version. First, a note on how character frequency is stored, Huffman Encoding does not need the ranking of the characters, just the frequencies.

    unsigned long frequency[257]
    Also, the tree I'm showing is just gonna do one tree for all the characters (and EOF, which will be element 256. freq[256] = 1 obviously). It shouldn't be too hard to adjust the tree making function to create three trees instead of one, just remember to decode it accordingly. The tree will be represented as an array of 513 nodes. 257 elements can only generate 257 + 256 = 513 nodes. For the first 257 nodes, those nodes will just be leaves of the tree, representing characters. The other 256 nodes will be actuall nodes. I'm gonna use an extra node just as a dummy maximum value.

    The function to create the first 257 nodes is trivial (something like, tree[i].freq = frequency[i] 257 times... This is for the other nodes:
    Code:
    typedef struct
    {
     int child0;
     int child1;
     unsigned long freq;
    } NODE;
    Code:
    unsigned int build_tree(NODE * tree)
    {
     int min1, min2;
     int next_node;
     int i;
     tree[513].freq = ULONG_MAX; //dummy value
     
     for(next_node = 257; ; next_node++) // break is in loop...
     {
      min1 = 513;
      min2 = 513;
      // 1. Figure out the node with the least occourance.
      //  Ignore nodes with 0 occourance.
      for (i = 0; i < next_node; i++)
      {
       if (tree[i] == 0) continue;
       if (tree[i].freq < tree[min1].freq) min1 = i;
       else if (tree[i].freq < tree[min2].freq) min2 = i;
      }
      // min1 and min2 index the nodes with the least freq.
      // if min2 is still pointing at the dummy val, then there is only 1
      //   active node left, the tree root.
      if (min2 == 513) break;
    
      // The weight of a node is the weight of the nodes below it.
      tree[next_node].freq = tree[min1].count + tree[min2].count;
      tree[next_node].child0 = min1;
      tree[next_node].child1 = min2;
      
      // deactivate nodes min1 and min2...
      tree[min1].freq = 0;
      tree[min2].freq = 0;
     }
    // tree is now set.
    return min1; // min1 should index root node at this point.
    }
    Okay, so now we have a tree... I'm not going to pretend that was simple. Encoding the tree is simpler, but recursive, and of course, uses bitwise operators.
    Code:
    typedef
    {
     unsigned long code;
     int bits;
    }
    A code is a compressed letter. Bits tells how many bits long the code is. Each character will have its own code, so we'll be using an array of 257 codes.
    Code:
    void tree2code (NODE * tree, CODE * codes, int node, unsigned long code, int bits)
    {
     if (node <= 257) // Halting condition.
     {
      code[node].code = code;
      code[node].bits = bits;
     }
     
     code <<= 1;
     bits++;
     tree2code (tree, codes, tree[node].child0, code, bits);
     tree2code (tree, codes, tree[node].child1, code|1, bits);
     return;
    }
    The first call to this function, BTW, needs to pass it code as 0 and bits as 0. The function to encode is trivial, just reading the file, and every time you encounter some char c in the input, you write code[char] to the output file. Of course, you'll have to use some binary operators to write the codes, since they will not be byte-sized values (the whole point of compression really).

    Again, I like to think I am helpful, but this is pretty hefty stuff, which took me a while to understand. I'm not SURE that this code is compilable, although I've looked it over pretty closely. I really do suggest hounding the best library in your area for books on "data compression", as they'll have more complete, and more efficient, code.

  5. #5
    Unregistered
    Guest
    Thanks the code you've written has been very helpful. One question I must ask is that I only want to restrict my encoding to 58 characters. How would I go about linking 2 26-character trees and 1 9 character (for punctuation) tree together? I figured that setting aside bit tags would work, but unsure on its implementation. Thank you.

  6. #6
    Registered User
    Join Date
    Sep 2001
    Posts
    752
    I'm sorry, but I really can't think of any way to do that that would really make any sense.

    Best I can figure is that you'd end up using 4 trees...
    the first tree for lowercase letters (00)
    second tree for uppercase letters (01)
    third tree for punctuation (10)
    fourth tree for other characters (11)

    and the first two bits of every code would specify which tree... the only problem is that is horrifyingly inefficient. Not to mention that for the punctuation characters, their ASCII values don't have any rhyme or reason, so you'll have to use a table for that.

    I'm not sure if this is an academic question, but it's really not a very good implementation to pursue.

  7. #7
    Sayeh
    Guest
    Actually, you only really need one tree, atleast to master the concepts of Huffman. Huffman is very simple AND elegant, which is it's power. There is no distinction between lower, upper, and number characters-- they are all ASCII.


    Here is essentially how it works--

    1) preflight data, counting all occurrences of each ASCII character (the "frequency" of each character).


    /* assume you've got a data source of arbitrary length */

    unsigned char data[100];

    /* frequency table */

    unsigned char freq[256];

    /* init frequency table */

    for(i=0;i<256;i++)
    freq[i]=0;

    /* get frequency of each occurrence of each ASCII character */

    for(i=0;i<100;i++)
    freq[data[i]]++;

    /* sort the table in descending order and eliminate null
    frequencies. This just means that the ASCII characters that
    were seen the most need to be at the top of the list, so
    they are considered last, and the ones that are never
    seen are stripped from the list so they are not considered
    at all */

    /* you may only add two adjacent frequencies together, and
    they must combine to form the smallest possible _combined frequency */

    For example:

    Data: "Mary had a little lamb."

    Frequency Table:

    M = 1
    a = 4
    r = 1
    y = 1
    = 4 (this is a space character)
    h = 1
    d = 1
    l = 3
    i = 1
    t = 2
    e = 1
    m = 1
    b = 1
    . = 1

    Sorted, looks like this:

    a = 4
    = 4 (space)
    l = 3
    t = 2
    M = 1
    r = 1
    y = 1
    h = 1
    d = 1
    i = 1
    e = 1
    m = 1
    b = 1
    . = 1

    Once the frequency table is built, you start combining the two lowest, _adjacent_ frequencies. It doesn't matter if a letter with a frequency of 1 (like 'e') is above or below any other letter with a frequency of 1 (like 'b'). All that matters is the frequency. The letter itself (its value, that is) becomes the index.

    Simple pair them to build the binary equivalents.

    enjoy.

  8. #8
    Sayeh
    Guest

    Huffman encoding

    I apologize for not completing the previous note, as I got rushed and was unable to give the encoded table for the above. It is this:


    a = 111
    space = 110
    l = 1011
    t = 1010
    M = 1001
    r = 1000
    y = 0111
    h = 0110
    d = 0101
    i = 0100
    e = 0011
    m = 0010
    b = 0001
    . = 0000

    As you can see, every character was compressed by atleast 1/2 (4 bits versus original 8) so the original data was compressed atleast 50%. Very effecient.

    questions?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Audio encoding (where to start)
    By cs_student in forum C++ Programming
    Replies: 4
    Last Post: 08-04-2008, 01:23 PM
  2. encoding scheme
    By blazer26 in forum C Programming
    Replies: 11
    Last Post: 05-02-2006, 11:45 PM
  3. huffman tree
    By blazer26 in forum C Programming
    Replies: 0
    Last Post: 04-25-2006, 03:39 PM
  4. huffman encoding help
    By Unregistered in forum C Programming
    Replies: 6
    Last Post: 04-17-2002, 12:51 PM
  5. Huffman compression
    By SMurf in forum C Programming
    Replies: 2
    Last Post: 09-01-2001, 07:30 AM