Couple questions/Huffman code

**Cpro** · 07-01-2008

I have a couple questions.

1) So, I understand that the Huffman compression method creates a binary tree depending on the number of occurrences/probability of a character (in my case). The higher the probability, the closer it is to the top of the tree. Each character in the tree is associated with a code, depending on its location in the tree, and this tree can be transformed into a table. Say, for example, you have this line being read from a file "AABC". So, I would count the frequency/probability of each letter and put it in a binary tree. So the table might look like:
[code]
A 2 010
B 1 000
C 1 1010
[code]
Here is where I'm a little confused.
When I output the compressed file it would look like:

Code:

010 010 000 1010

But how is this compressed? There are more characters now than there was in the beginning. So, if I read in the "compressed" file, i would be reading in 13 characters instead of the original four.

2) How come i can do:

Code:

char name[15] = "David";

but not:

Code:

char name[15];
name = "David";

Same here:

Code:

class Person
{
public:
	string LastName;
	char FirstName [11];
	Person();
};

I can do this with a string: 
person1.LastName = "David";
but not with the char:
person1.Address = "29th";

I want to be able to declare the char array by itself, and then assign it a value later on. It must be a char array of a fixed length.

**abachler** · 07-01-2008

no, the 0 and 1 are bits in the sequence, not individual characters (ascii). so you took 3 ascii characters (24 bits) and compressed it to 13 bits, which is just under 2 bytes for a 45.8% compression. You must be using a suboptimal huffman algorithm, because huffman would produce for AABC

A 2 0
B 1 10
C 1 11

with tthe result being 001011 or 7 bits for 78.1% compression

**Cpro** · 07-01-2008

Then, do i read the compressed file differently than the non-compressed?

For example, when reading from the un-compressed (AABC):

Code:

char letter;
while(1)
{
     inFile >> letter;
     if(inFile.fail()) break;
}

So, this would read four (five if you include the break) times.

But when reading from the compressed file (001011)

Code:

int n;
while(1)
{
     inFile >> n;
     if(inFile.fail()) break;
}

This would read 6 (7) times, which is more than the original.

"You must be using a suboptimal huffman algorithm, because huffman would produce for AABC"
I just made the numbers up.

Thanks.

**whoie** · 07-01-2008

I think you are confusing binary and text modes here.

You aren't going to write out the character '1' and '0', you are going to set bits in a bitmap (basically) and write the bitmap out.

Let's assume that you are using 8-bit chars for this.

The 8-bit code for 'A' uses 8-bits, or one char of space in the file. So AABC would take your four chars. The huffman encoded file, like abachler was trying to point out to you, contains 001011 which uses 7-bits, or slightly *less* than one char of space in the file. You went from 4 to 1.

So Yes! You read the compressed file differently than the non-compressed file. Does that help clear it up a little more?

Also, you can't use simple assignment for arrays (as in your second part of your original post). You would have to use a function like std::strcpy.

**Cpro** · 07-02-2008

"You aren't going to write out the character '1' and '0', you are going to set bits in a bitmap (basically) and write the bitmap out."

Is there a link to an example on how to do this and/or what the compressed output file looks like? I tried searching, but I'm not exactly sure what i'm looking for. Creating the tree and code for each character should be no problem, but i'm not sure how to write the code (bitmap) in the output file.
Everything i've read about Huffman encoding explains how to create the tree and table, but not how to output/create the compressed file from that information.

Thanks.

**abachler** · 07-02-2008

There are 2 standard methods.

Bitfields. C/C++ let you directly manipulate the bitfields in a structure

Code:

struct BitField {
   a : 1;
   b : 1;
   c : 1;
   d : 1;
   e : 1;
   f : 1;
   g : 1;
   h : 1;
   } MyField;

creates a structure with members a,b,c,d which are each 1 bit in length. So setting MyField.a = 1; sets the bit associated with a to 1. MyField would take up a single byte only.

The other method, and the one I use most, is to do bit shifting. This method is easier to impliment in assembly. It is also easier to impliment in assembly than in C/C++, at least if you are comfortable with C/C++/ASM.

Another thing you need to deal with is the library for your compressed output. When you read the compressed file in, your decompressor needs to know the translation to use. For example, an input of AABC and an input of DDFG woudl both produce the same output code (001011) but would have different symbols associated with 0 , 10, and 11. So your output file has to define what each compressed symbol means so the decompressor can produce the correct output.

**Cpro** · 07-02-2008

Ok, let me see if I understand this.

Code:

struct BitField
{
     unsigned int a : 1;
     unsigned int b : 1;
     unsigned int c : 1;
     unsigned int d : 1;
}MyField;

So, using the example "AABC" with the code:

Code:

A 2 0
B 1 10
C 1 11

Then would i do:

Code:

MyField.a = 0;
MyField.b = 10;
MyField.c = 11;

But, i would also have to change the number of bits allocated to c & d to two (in the struct)?
Then would i just output it to the file like:

Code:

cout << MyField.a << MyField.a << MyField.b << MyField.c;

**whoie** · 07-02-2008

A bit of advice, don't use bitfields for this. They aren't really going to get very good mileage for this problem. But no, your usage is wrong. The bitfield member will get promoted to int when it gets written IIRC*, so you are going to have to go the bitmap route anyway.

I hesitate to post what I do, because it's pretty bad form in C++. I'm an old C programmer though, so I'm used to the abuse. See if this gives you an idea for a class you could make:

C FAQ on arrays of bits

Err, wait! I think C++ has std::bitset IIRC. Check that out first, but make sure you can get a bitmap back out of it.

So you will need to make an "array of bits" large enough to hold your entire compressed file, and simply walk down it setting and clearing bits as you encode the file. Then, you write the table to the file, followed by the bitmap. Then, when you go to read the file again, you read the table first, in order to be able to decode the file that you will read next.

HTH!

*I don't have any resources on hand at the moment to reference this stuff, so I'm relying totally on memory. I'm almost never 100% correct from memory anymore, but the general idea is right.

**dwks** · 07-02-2008

You might run into issues with structure padding by using bitfields, as well.

You could use something like this . . . .

Code:

int set_bit(int num, int bit, int to) {
    if(to) num |= (1 << bit);
    else num ^= (1 << bit);
    return num;
}

int get_bit(int num, int bit) {
    return num & (1 << bit);
}

I think I have that right . . . .

**abachler** · 07-02-2008

if promotion is an issue, you could use #pragma pack(1). That will guarantee byte alignment and sizeing.

**dwks** · 07-02-2008

That's for MSVC only, of course.

Recent thread about structure alignment: http://cboard.cprogramming.com/showthread.php?t=104705

**abachler** · 07-02-2008

Well, as I said, I usually use bit shifting in assembly, so i have direct control over the sizing issue.

Thread: Couple questions/Huffman code

Thread Tools

Search Thread

Display

Couple questions/Huffman code

Similar Threads

Binary Search Trees Part III

Math function code - couple probs, tips?

Interface Question

Couple Code Snippets

Can you have nested code block? What does the compiler do? For example ...