I'm a bit confused on how to encode an actual tree based on the Huffman Compression scheme (compressing a text file based on the frequency of characters in a text file. Would anyone be able to point me to some sample code? I understand the principles behind it, but writing the code what I am having trouble with.
I can try and help you, but what kind of tree you wanna encode?
I'd like to encode 3 trees, with 58 elements in all. 2 26 alphabet (upper-case and lower-case) trees and 9 punctuation symbols (comma, period, backslash, forward-slash, hyphen, semicolon, colon, quotation mark, and apostrophe).
For what I have right now is a couple functions scanning a text and counting the frequency of characters in an array with the same amount of elements (58). After the proper steps, an array gives the ranking of each element with a numebr. Therefore, freqarray = 1 would mean that 'A' is second in the rankings.
Hope that helps.
If I first might make a suggestion,
The Compression Book by Mark Nelson and Jean-Loup Gailly is a book I've been looking at lately, and it does have source for pretty much everything.
I'm going to umm, paraphrase their code now, basically an ineffecient version. First, a note on how character frequency is stored, Huffman Encoding does not need the ranking of the characters, just the frequencies.
unsigned long frequency
Also, the tree I'm showing is just gonna do one tree for all the characters (and EOF, which will be element 256. freq = 1 obviously). It shouldn't be too hard to adjust the tree making function to create three trees instead of one, just remember to decode it accordingly. The tree will be represented as an array of 513 nodes. 257 elements can only generate 257 + 256 = 513 nodes. For the first 257 nodes, those nodes will just be leaves of the tree, representing characters. The other 256 nodes will be actuall nodes. I'm gonna use an extra node just as a dummy maximum value.
The function to create the first 257 nodes is trivial (something like, tree[i].freq = frequency[i] 257 times... This is for the other nodes:
unsigned long freq;
Okay, so now we have a tree... I'm not going to pretend that was simple. Encoding the tree is simpler, but recursive, and of course, uses bitwise operators.
unsigned int build_tree(NODE * tree)
int min1, min2;
tree.freq = ULONG_MAX; //dummy value
for(next_node = 257; ; next_node++) // break is in loop...
min1 = 513;
min2 = 513;
// 1. Figure out the node with the least occourance.
// Ignore nodes with 0 occourance.
for (i = 0; i < next_node; i++)
if (tree[i] == 0) continue;
if (tree[i].freq < tree[min1].freq) min1 = i;
else if (tree[i].freq < tree[min2].freq) min2 = i;
// min1 and min2 index the nodes with the least freq.
// if min2 is still pointing at the dummy val, then there is only 1
// active node left, the tree root.
if (min2 == 513) break;
// The weight of a node is the weight of the nodes below it.
tree[next_node].freq = tree[min1].count + tree[min2].count;
tree[next_node].child0 = min1;
tree[next_node].child1 = min2;
// deactivate nodes min1 and min2...
tree[min1].freq = 0;
tree[min2].freq = 0;
// tree is now set.
return min1; // min1 should index root node at this point.
A code is a compressed letter. Bits tells how many bits long the code is. Each character will have its own code, so we'll be using an array of 257 codes.
unsigned long code;
The first call to this function, BTW, needs to pass it code as 0 and bits as 0. The function to encode is trivial, just reading the file, and every time you encounter some char c in the input, you write code[char] to the output file. Of course, you'll have to use some binary operators to write the codes, since they will not be byte-sized values (the whole point of compression really).
void tree2code (NODE * tree, CODE * codes, int node, unsigned long code, int bits)
if (node <= 257) // Halting condition.
code[node].code = code;
code[node].bits = bits;
code <<= 1;
tree2code (tree, codes, tree[node].child0, code, bits);
tree2code (tree, codes, tree[node].child1, code|1, bits);
Again, I like to think I am helpful, but this is pretty hefty stuff, which took me a while to understand. I'm not SURE that this code is compilable, although I've looked it over pretty closely. I really do suggest hounding the best library in your area for books on "data compression", as they'll have more complete, and more efficient, code.
Thanks the code you've written has been very helpful. One question I must ask is that I only want to restrict my encoding to 58 characters. How would I go about linking 2 26-character trees and 1 9 character (for punctuation) tree together? I figured that setting aside bit tags would work, but unsure on its implementation. Thank you.
I'm sorry, but I really can't think of any way to do that that would really make any sense.
Best I can figure is that you'd end up using 4 trees...
the first tree for lowercase letters (00)
second tree for uppercase letters (01)
third tree for punctuation (10)
fourth tree for other characters (11)
and the first two bits of every code would specify which tree... the only problem is that is horrifyingly inefficient. Not to mention that for the punctuation characters, their ASCII values don't have any rhyme or reason, so you'll have to use a table for that.
I'm not sure if this is an academic question, but it's really not a very good implementation to pursue.
Actually, you only really need one tree, atleast to master the concepts of Huffman. Huffman is very simple AND elegant, which is it's power. There is no distinction between lower, upper, and number characters-- they are all ASCII.
Here is essentially how it works--
1) preflight data, counting all occurrences of each ASCII character (the "frequency" of each character).
/* assume you've got a data source of arbitrary length */
unsigned char data;
/* frequency table */
unsigned char freq;
/* init frequency table */
/* get frequency of each occurrence of each ASCII character */
/* sort the table in descending order and eliminate null
frequencies. This just means that the ASCII characters that
were seen the most need to be at the top of the list, so
they are considered last, and the ones that are never
seen are stripped from the list so they are not considered
at all */
/* you may only add two adjacent frequencies together, and
they must combine to form the smallest possible _combined frequency */
Data: "Mary had a little lamb."
M = 1
a = 4
r = 1
y = 1
= 4 (this is a space character)
h = 1
d = 1
l = 3
i = 1
t = 2
e = 1
m = 1
b = 1
. = 1
Sorted, looks like this:
a = 4
= 4 (space)
l = 3
t = 2
M = 1
r = 1
y = 1
h = 1
d = 1
i = 1
e = 1
m = 1
b = 1
. = 1
Once the frequency table is built, you start combining the two lowest, _adjacent_ frequencies. It doesn't matter if a letter with a frequency of 1 (like 'e') is above or below any other letter with a frequency of 1 (like 'b'). All that matters is the frequency. The letter itself (its value, that is) becomes the index.
Simple pair them to build the binary equivalents.
I apologize for not completing the previous note, as I got rushed and was unable to give the encoded table for the above. It is this:
a = 111
space = 110
l = 1011
t = 1010
M = 1001
r = 1000
y = 0111
h = 0110
d = 0101
i = 0100
e = 0011
m = 0010
b = 0001
. = 0000
As you can see, every character was compressed by atleast 1/2 (4 bits versus original 8) so the original data was compressed atleast 50%. Very effecient.