Thread: Huffman code for single character file?

  1. #1
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879

    Huffman code for single character file?

    Hey guys, I'm currently trying to implement huffman compress/decompress functions. They generally work properly, but I've run into a problem: If a file only contains one type of character, say for example 'a', then when I build the tree it will have only one node - the root node - and consequently it will have no code associated with it. Thus when I write the 'compressed' file, only a frequency table and 'bits in last byte' count are written to file, and when I 'decompress' the file, I end up with an empty file even if the original was say half a gigabyte. Is there any neat solution to this problem without having to resort to special cases (for examples, if root node has no childs, assign it a code of 0)? And also how would the tree be built if the original file was empty to begin with?

    **P.s. I just thought of something... have a root-root node with its 'left' pointer referencing the 'real' root node. Would this work?

    **EDIT**
    After doing some board searches, I'm gonna look up the Deflate algorithm and see if that helps any... But I'd still very much appreciate any responses to my question.
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  2. #2
    C++ Developer XSquared's Avatar
    Join Date
    Jun 2002
    Location
    Ontario, Canada
    Posts
    2,718
    I'd just create a dummy node like you suggested.
    Naturally I didn't feel inspired enough to read all the links for you, since I already slaved away for long hours under a blistering sun pressing the search button after typing four whole words! - Quzah

    You. Fetch me my copy of the Wall Street Journal. You two, fight to the death - Stewie

  3. #3
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    Ok, well that fixed the problem for the empty and single-character files, and continues to function for very small test files (ASCII text). However, I tried compressing a ~800k bitmap, and it ended up the same size plus 1kb, and when I decompressed it it ended up noticeably smaller than the original (not to mention its 'compressed' size), and understandably it was unviewable using paint. I'm currently reviewing my code and trying to find problems. Could someone take a look at the following and make sure there's nothing sneaky in it? I don't see anything but I'd like to be sure...
    **EDIT** root refers to the dummy node referencing the real root.
    Code:
    typedef std::map<unsigned char, std::string> SymbolMap_t;
     
    void createSymbolMap(Node* root, SymbolMap_t& symbolMap, int lr = 0, std::string baseCode = "")
    {
       if(lr == -1)
    	 baseCode += '0';
       else if(lr == 1)
    	 baseCode += '1';
     
       if(root->left == NULL && root->right == NULL)
    	 symbolMap[root->val] = baseCode;
       else
       {
    	 if(root->left != NULL)
    		createSymbolMap(root->left, symbolMap, -1, baseCode);
    	 if(root->right != NULL)
    		createSymbolMap(root->right, symbolMap, 1, baseCode);
       }
    }
    P.s. XSquared, a question about your A70-TS1: Do you have ay idea what the alt-f12 hotkey's supposed to do? The user manual stops at Alt-F10
    Last edited by Hunter2; 09-04-2004 at 08:51 PM.
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  4. #4
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    Err oops, never mind that last bit about looking over the code (except, XSquared, the part about the hotkey ). I just realized, the decoded file matches the original exactly except that it is truncated by 100 or so kb. Then, inspecting the compressed file in a hex editor, I found that the last encoded symbol doesn't match the last byte in the original file... which means *gasp* the compressed file should probably be 100k larger than the original . So now I'm going to go over my compressing code and try to figure out why part of the file is missing.

    [INSERTED]Simple bug it seems, the codes for a lot of symbols exceeded 8 bits, and for each symbol inserted into my BitStream class object only a maximum of 8 bits were extracted (1 byte written to file); thus the missing bits. I fixed it by replacing an if with while.[/INSERTED]

    I'd still like to know why the compressed file ends up so much larger than the original though; I'm pretty sure I followed the Huffman concept properly, generating a tree with the lowest-probability symbols having the longest codes through bottom-up parentizing, or whatever it's called. The only thing I can think of is that extra node I added (I know it increases the size by a huge factor), but then if I get rid of it then I'm back where I started with the single-character and empty files. This is driving me insane
    Last edited by Hunter2; 09-04-2004 at 09:55 PM.
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  5. #5
    C++ Developer XSquared's Avatar
    Join Date
    Jun 2002
    Location
    Ontario, Canada
    Posts
    2,718
    Actually, I wrote up a Huffman class for a contest a while ago on this board. If you want to compeare your code to mine you can.
    Naturally I didn't feel inspired enough to read all the links for you, since I already slaved away for long hours under a blistering sun pressing the search button after typing four whole words! - Quzah

    You. Fetch me my copy of the Wall Street Journal. You two, fight to the death - Stewie

  6. #6
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    Ok, I'll take a look. But I'd still like to know what the toshiba hotkey does!
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  7. #7
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    XSquared, apparently your Huffman algorithm also falls down on the single-character files (test file: "aaaaaaaaaaaaaaaa"). It crashes while 'merging' nodes, because your code creates a new parent node, assigns nodes.at(0) to the left pointer, erases nodes.begin(), then assigns nodes.at(0) to the right pointer. The problem with both our programs is that with a single-character file, there is only one node to begin with... and thus, mine gave a blank file and yours tried accessing a non-existent vector element.

    Can anyone here help out two poor blokes victimized by Huffman?
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  8. #8
    C > C++ duders ggs's Avatar
    Join Date
    Aug 2001
    Posts
    435
    well, if you're eating up your entire input first, throw in a special case and encode the entire run of 'x' with rle
    .sect signature

  9. #9
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    >>if you're eating up your entire input first
    By that are you referring to the first run-through of the file in building the frequency table?

    How would RLE be worked into Huffman anyway?.. Or are you suggesting that I just break away from 'pure' huffman in my compression function? (I'd been hoping to avoid that, since 'huffmanEncode()/Decode()' sounds cooler than 'compress()/decompress()' )
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  10. #10
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    >Can anyone here help out two poor blokes victimized by Huffman?
    What fun would that be? I have no doubt that with sufficient research you'll find out what is going wrong.

    >How would RLE be worked into Huffman anyway?
    As a special case, if the frequency table only has one entry then you use RLE. It wouldn't be terribly difficult to compress, but the decoding would make for some interesting parsing.
    My best code is written with the delete key.

  11. #11
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    Err ok, I solved the 'huge file' problem by adding a special case in my tree-generating algo so that the dummy node only gets created when there's less than 2 distinct symbols; and that way empty and single-character files still work with the existing huffman implementation.

    Ah well, I'm probably going to try doing the 'series of compression algos' thing anyway, since it's supposed to get you better compression.


    **EDIT**
    >>What fun would that be? I have no doubt that with sufficient research you'll find out what is going wrong.
    Well, yes, I'd already figured that part out. It's just the part about fixing it that I was having trouble with

    But, I'd hoped that someone would have a way to fix it without resorting to special cases like I eventually did (the various sites that I went to didn't even mention the single-character case, so I assumed there would be an algo somewhere out there that would cover it without special cases).
    Last edited by Hunter2; 09-05-2004 at 04:41 PM.
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  12. #12
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    >It's just the part about fixing it that I was having trouble with
    By "what is going wrong" I meant "why it doesn't work, thus showing you how to fix it".

    >I'd hoped that someone would have a way to fix it without resorting to special cases like I eventually did
    When the algorithm is properly implemented, your problems don't exist. That's why I was talking about research rather than giving you hints about what special case to handle.
    My best code is written with the delete key.

  13. #13
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    >>When the algorithm is properly implemented, your problems don't exist.
    After looking at XSquared's code (in which the problem does exist), and also at Michael Dipperstein's code (version 0.1 at this site), which also uses a special case of sorts to handle the one-symbol bug, I was getting the message that there was no way to do it aside from special cases. But now that I *know* that a solution exists, I guess I'll do some more (re)searching and see if I can find out what you're referring to
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  14. #14
    S Sang-drax's Avatar
    Join Date
    May 2002
    Location
    Göteborg, Sweden
    Posts
    2,072
    Are your symbols 8-bit long, or do you use a different alphabet?

    How much does your implementation compress an average text file?

    Is the Huffman algorithm really that good? I mean, only the 8 most common characters of 256 would be encoded with fewer than 8 bits. All other would require more than 8 bits to encode. Obviously, there would be some gains, unless the content is completely random, but how much?

    Also, why traverse a tree when decoding files? Why not just count the number of binary ones until the next binary zero and then use that number to look up the character in an array?
    Last edited by Sang-drax; 09-06-2004 at 10:07 AM.
    Last edited by Sang-drax : Tomorrow at 02:21 AM. Reason: Time travelling

  15. #15
    S Sang-drax's Avatar
    Join Date
    May 2002
    Location
    Göteborg, Sweden
    Posts
    2,072
    I don't know if this is what you are looking for, but I just wrote this code that encodes a text string using Huffman compression.

    It handles the single-character case correctly, at least.

    There are probably a few bugs in this code, I wrote it in sort of a hurry.
    Code:
    include <iostream>
    #include <map>
    #include <algorithm>
    #include <fstream>
    using namespace std;
    void writeByte(char)
    {
    }
    int totalNumberOfBits = 0;
    void writeBits(int bit, int numberOfBits )
    {
    	totalNumberOfBits += numberOfBits;
    	for (int i=0;i<numberOfBits;++i)
    		cout << bit;
    }
    struct CharData
    {
    	CharData()
    	{
    		freq = ch = 0;
    	}
    	unsigned char ch;
    	int freq;
    };
    bool freqLess(const CharData& left, const CharData& right)
    {
    	return left.freq > right.freq;
    }
    void huffmanCompress(unsigned char data[], int length)
    {
    	//Count the frequencies of the different 
    	//characters
    	CharData chData[256];
    	for (int c=0; c<length; ++c)
    	{
    		chData[data[c]].ch = data[c];
    		chData[data[c]].freq ++;
    	}
    	//Sort the frequencies
    	std::sort(chData, chData+256, freqLess);
    	//Create a vector that converts a character
    	//into the appropiate number of ones
    	int charToNumber[256];
    	//Go through the frequency table 
    	int maximumBits = 1;
     
    	for (int i = 0; i<256 && chData[i].freq != 0; ++i)
    	{
    		//Save in the conversion table
    		charToNumber[ chData[i].ch ] = i;
    		if (maximumBits <= i)
    			maximumBits = i;
    		//Save to the beginning of the file which
    		//is the frequency table
    		writeByte( chData[i].ch );
    		//DEBUG
    		cout << "\'" << chData[i].ch << "\' will be represented with " << i << " ones." << endl;
    	}
    	//Mark the end of the header
    	writeByte(0);
    	writeByte(0);
    	cout << endl;
    	//Save the characters 
    	for (int c=0; c<length; ++c)
    	{
    		writeBits( 1, charToNumber[data[c]] );
    		if (charToNumber[data[c]] < maximumBits)
    			//Write a zero
    			writeBits(0,1);
    		cout << " ";
    	}
    }
     
    int main() 
    {
    	unsigned char dataFile[10240] = "";
    	ifstream fin("C:\\Petter\\Text.txt");
    	int p=0;
    	while (fin)
    		dataFile[p++] = fin.get();
    	p--;
    	huffmanCompress(dataFile, p);
    	cout << endl << "Total number of original bytes: " << p;
    	cout << endl << "Total number of compresssed bytes: " << (totalNumberOfBits>>3);
    	cin.get();
    }
    EDIT: Perhaps I'm not implementing the algorithm right (I've never looked at it before). I'm getting
    Code:
    Total number of original bytes: 9895
    Total number of compresssed bytes: 13335
    For some larger text I tried to compress. Compresses fine for small amounts, though.
    Last edited by Sang-drax; 09-06-2004 at 11:01 AM.
    Last edited by Sang-drax : Tomorrow at 02:21 AM. Reason: Time travelling

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. File transfer- the file sometimes not full transferred
    By shu_fei86 in forum C# Programming
    Replies: 13
    Last Post: 03-13-2009, 12:44 PM
  2. Formatting a text file...
    By dagorsul in forum C Programming
    Replies: 12
    Last Post: 05-02-2008, 03:53 AM
  3. Need Help Fixing My C Program. Deals with File I/O
    By Matus in forum C Programming
    Replies: 7
    Last Post: 04-29-2008, 07:51 PM
  4. Can we have vector of vector?
    By ketu1 in forum C++ Programming
    Replies: 24
    Last Post: 01-03-2008, 05:02 AM