Huffman code for single character file?

**Hunter2** · 09-04-2004

Hey guys, I'm currently trying to implement huffman compress/decompress functions. They generally work properly, but I've run into a problem: If a file only contains one type of character, say for example 'a', then when I build the tree it will have only one node - the root node - and consequently it will have no code associated with it. Thus when I write the 'compressed' file, only a frequency table and 'bits in last byte' count are written to file, and when I 'decompress' the file, I end up with an empty file even if the original was say half a gigabyte. Is there any neat solution to this problem without having to resort to special cases (for examples, if root node has no childs, assign it a code of 0)? And also how would the tree be built if the original file was empty to begin with?

**P.s. I just thought of something... have a root-root node with its 'left' pointer referencing the 'real' root node. Would this work?

**EDIT**
After doing some board searches, I'm gonna look up the Deflate algorithm and see if that helps any... But I'd still very much appreciate any responses to my question.

**XSquared** · 09-04-2004

I'd just create a dummy node like you suggested.

**Hunter2** · 09-04-2004

Ok, well that fixed the problem for the empty and single-character files, and continues to function for very small test files (ASCII text). However, I tried compressing a ~800k bitmap, and it ended up the same size plus 1kb, and when I decompressed it it ended up noticeably smaller than the original (not to mention its 'compressed' size), and understandably it was unviewable using paint. I'm currently reviewing my code and trying to find problems. Could someone take a look at the following and make sure there's nothing sneaky in it? I don't see anything but I'd like to be sure...
**EDIT** root refers to the dummy node referencing the real root.

Code:

typedef std::map<unsigned char, std::string> SymbolMap_t;
 
void createSymbolMap(Node* root, SymbolMap_t& symbolMap, int lr = 0, std::string baseCode = "")
{
   if(lr == -1)
	 baseCode += '0';
   else if(lr == 1)
	 baseCode += '1';
 
   if(root->left == NULL && root->right == NULL)
	 symbolMap[root->val] = baseCode;
   else
   {
	 if(root->left != NULL)
		createSymbolMap(root->left, symbolMap, -1, baseCode);
	 if(root->right != NULL)
		createSymbolMap(root->right, symbolMap, 1, baseCode);
   }
}

P.s. XSquared, a question about your A70-TS1: Do you have ay idea what the alt-f12 hotkey's supposed to do? The user manual stops at Alt-F10

**Hunter2** · 09-04-2004

Err oops, never mind that last bit about looking over the code (except, XSquared, the part about the hotkey

). I just realized, the decoded file matches the original exactly except that it is truncated by 100 or so kb. Then, inspecting the compressed file in a hex editor, I found that the last encoded symbol doesn't match the last byte in the original file... which means *gasp* the compressed file should probably be 100k larger than the original

. So now I'm going to go over my compressing code and try to figure out why part of the file is missing.

[INSERTED]Simple bug it seems, the codes for a lot of symbols exceeded 8 bits, and for each symbol inserted into my BitStream class object only a maximum of 8 bits were extracted (1 byte written to file); thus the missing bits. I fixed it by replacing an if with while.[/INSERTED]

I'd still like to know why the compressed file ends up so much larger than the original though; I'm pretty sure I followed the Huffman concept properly, generating a tree with the lowest-probability symbols having the longest codes through bottom-up parentizing, or whatever it's called. The only thing I can think of is that extra node I added (I know it increases the size by a huge factor), but then if I get rid of it then I'm back where I started with the single-character and empty files. This is driving me insane

**XSquared** · 09-05-2004

Actually, I wrote up a Huffman class for a contest a while ago on this board. If you want to compeare your code to mine you can.

**Hunter2** · 09-05-2004

Ok, I'll take a look. But I'd still like to know what the toshiba hotkey does!

**Hunter2** · 09-05-2004

XSquared, apparently your Huffman algorithm also falls down on the single-character files (test file: "aaaaaaaaaaaaaaaa"). It crashes while 'merging' nodes, because your code creates a new parent node, assigns nodes.at(0) to the left pointer, erases nodes.begin(), then assigns nodes.at(0) to the right pointer. The problem with both our programs is that with a single-character file, there is only one node to begin with... and thus, mine gave a blank file and yours tried accessing a non-existent vector element.

Can anyone here help out two poor blokes victimized by Huffman?

**ggs** · 09-05-2004

well, if you're eating up your entire input first, throw in a special case and encode the entire run of 'x' with rle

**Hunter2** · 09-05-2004

>>if you're eating up your entire input first
By that are you referring to the first run-through of the file in building the frequency table?

How would RLE be worked into Huffman anyway?.. Or are you suggesting that I just break away from 'pure' huffman in my compression function? (I'd been hoping to avoid that, since 'huffmanEncode()/Decode()' sounds cooler than 'compress()/decompress()'

)

**Prelude** · 09-05-2004

>Can anyone here help out two poor blokes victimized by Huffman?
What fun would that be?

I have no doubt that with sufficient research you'll find out what is going wrong.

>How would RLE be worked into Huffman anyway?
As a special case, if the frequency table only has one entry then you use RLE. It wouldn't be terribly difficult to compress, but the decoding would make for some interesting parsing.

**Hunter2** · 09-05-2004

Err ok, I solved the 'huge file' problem by adding a special case in my tree-generating algo so that the dummy node only gets created when there's less than 2 distinct symbols; and that way empty and single-character files still work with the existing huffman implementation.

Ah well, I'm probably going to try doing the 'series of compression algos' thing anyway, since it's supposed to get you better compression.

**EDIT**
>>What fun would that be?

I have no doubt that with sufficient research you'll find out what is going wrong.
Well, yes, I'd already figured that part out. It's just the part about fixing it that I was having trouble with

But, I'd hoped that someone would have a way to fix it without resorting to special cases like I eventually did (the various sites that I went to didn't even mention the single-character case, so I assumed there would be an algo somewhere out there that would cover it without special cases).

**Prelude** · 09-05-2004

>It's just the part about fixing it that I was having trouble with
By "what is going wrong" I meant "why it doesn't work, thus showing you how to fix it".

>I'd hoped that someone would have a way to fix it without resorting to special cases like I eventually did
When the algorithm is properly implemented, your problems don't exist. That's why I was talking about research rather than giving you hints about what special case to handle.

**Hunter2** · 09-05-2004

>>When the algorithm is properly implemented, your problems don't exist.
After looking at XSquared's code (in which the problem does exist), and also at Michael Dipperstein's code (version 0.1 at this site), which also uses a special case of sorts to handle the one-symbol bug, I was getting the message that there was no way to do it aside from special cases. But now that I *know* that a solution exists, I guess I'll do some more (re)searching and see if I can find out what you're referring to

**Sang-drax** · 09-06-2004

Are your symbols 8-bit long, or do you use a different alphabet?

How much does your implementation compress an average text file?

Is the Huffman algorithm really that good? I mean, only the 8 most common characters of 256 would be encoded with fewer than 8 bits. All other would require more than 8 bits to encode. Obviously, there would be some gains, unless the content is completely random, but how much?

Also, why traverse a tree when decoding files? Why not just count the number of binary ones until the next binary zero and then use that number to look up the character in an array?

**Sang-drax** · 09-06-2004

I don't know if this is what you are looking for, but I just wrote this code that encodes a text string using Huffman compression.

It handles the single-character case correctly, at least.

There are probably a few bugs in this code, I wrote it in sort of a hurry.

Code:

include <iostream>
#include <map>
#include <algorithm>
#include <fstream>
using namespace std;
void writeByte(char)
{
}
int totalNumberOfBits = 0;
void writeBits(int bit, int numberOfBits )
{
	totalNumberOfBits += numberOfBits;
	for (int i=0;i<numberOfBits;++i)
		cout << bit;
}
struct CharData
{
	CharData()
	{
		freq = ch = 0;
	}
	unsigned char ch;
	int freq;
};
bool freqLess(const CharData& left, const CharData& right)
{
	return left.freq > right.freq;
}
void huffmanCompress(unsigned char data[], int length)
{
	//Count the frequencies of the different 
	//characters
	CharData chData[256];
	for (int c=0; c<length; ++c)
	{
		chData[data[c]].ch = data[c];
		chData[data[c]].freq ++;
	}
	//Sort the frequencies
	std::sort(chData, chData+256, freqLess);
	//Create a vector that converts a character
	//into the appropiate number of ones
	int charToNumber[256];
	//Go through the frequency table 
	int maximumBits = 1;
 
	for (int i = 0; i<256 && chData[i].freq != 0; ++i)
	{
		//Save in the conversion table
		charToNumber[ chData[i].ch ] = i;
		if (maximumBits <= i)
			maximumBits = i;
		//Save to the beginning of the file which
		//is the frequency table
		writeByte( chData[i].ch );
		//DEBUG
		cout << "\'" << chData[i].ch << "\' will be represented with " << i << " ones." << endl;
	}
	//Mark the end of the header
	writeByte(0);
	writeByte(0);
	cout << endl;
	//Save the characters 
	for (int c=0; c<length; ++c)
	{
		writeBits( 1, charToNumber[data[c]] );
		if (charToNumber[data[c]] < maximumBits)
			//Write a zero
			writeBits(0,1);
		cout << " ";
	}
}
 
int main() 
{
	unsigned char dataFile[10240] = "";
	ifstream fin("C:\\Petter\\Text.txt");
	int p=0;
	while (fin)
		dataFile[p++] = fin.get();
	p--;
	huffmanCompress(dataFile, p);
	cout << endl << "Total number of original bytes: " << p;
	cout << endl << "Total number of compresssed bytes: " << (totalNumberOfBits>>3);
	cin.get();
}

EDIT: Perhaps I'm not implementing the algorithm right (I've never looked at it before). I'm getting

Code:

Total number of original bytes: 9895
Total number of compresssed bytes: 13335

For some larger text I tried to compress. Compresses fine for small amounts, though.

Thread: Huffman code for single character file?

Thread Tools

Search Thread

Display

Huffman code for single character file?

Similar Threads

File transfer- the file sometimes not full transferred

Formatting a text file...

Need Help Fixing My C Program. Deals with File I/O

Can we have vector of vector?