Thread: hash table

  1. #1
    Registered User
    Join Date
    May 2009
    Posts
    242

    hash table

    in the attempt to learn how hash tables really work, i decided to try implementing just a simple one (again working from the guidelines in Cormen, et al., Intro to Algorithms), where i'm storing key-value pairs where both key and value are strings and where duplicate keys are allowed.

    Here's what I've got as header file to start with but wanted to hear whether this is right in terms of the basic structure of what a hash table actually is (no hash function implemented yet since this is just describing the structure):

    Code:
    #ifndef HASH_TABLE_H
    #define HASH_TABLE_H
    
    #include <list>
    #include <utility>	// for pair to get key-value pairs
    #include <string>
    
    typedef std::pair<std::string, std::string> KeyValue;	// key-value pair where key and value are both strings
    typedef std::list<KeyValue> KeyValueList;
    const double SIZE_FACTOR = 1.33;
    
    class HashTable {
    private:	
    	KeyValueList * h_tbl;
    	unsigned tbl_size;	// size of the hash_table
    	unsigned prime_generator(unsigned min);	// will generate the next prime larger than min
    public:
    	HashTable(const size_t size);
    	~HashTable() {delete [] h_tbl; }	// constructor will clearly use new []
    
    	// basic hash_table operations, should all have average runtime of O(1):
    	void insert(const KeyValue &item);
    	/**
    	@param key The key which is to be located
    	@return If the key is found at some index, returns an index where it is found
    	Note that if there are multiple instances of key, it will be unpredictable
    	which one is returned
    	If the key is not found anywhere, then -1 is returned
    	*/
    	int search(const std::string &key) const;
    	/**
    	@param key The key to find and delete. All instances of key should be deleted.
    	@return Returns true iff the key was found and something was deleted.
    	Returns false if key not found (in which case no action is taken).
    	*/
    	bool remove(const std::string &key);
    };
    
    #endif
    i guess my own initial question would be on the search() function, which seems very basic at this point. Extrernally, the integer returned is useful only as flag for whether the key exists or not. even if you allow duplicates, i would think you'd eventually in practice at least want something that would give you a list (as return value in C++, std::vector<std::string> might be better than std::list<std::string> ?) of all values associated with the given key.

  2. #2
    Registered User
    Join Date
    May 2009
    Posts
    242
    I think this cleans up some things and also adds functionality that we would want for testing how well our hash table is actually doing in terms of collision avoidance:

    Code:
    #ifndef HASH_TABLE_H
    #define HASH_TABLE_H
    
    #include <list>
    #include <utility>	// for pair to get key-value pairs
    #include <string>
    #include <vector>
    
    typedef std::pair<std::string, std::string> KeyValue;	// key-value pair where key and value are both strings
    typedef std::list<KeyValue> KeyValueList;
    const double SIZE_FACTOR = 1.33;
    
    class HashTable {
    private:	
    	KeyValueList * h_tbl;
    	unsigned tbl_size;	// size of the hash_table
    	/**
    	The hash function
    	Note that this is not a static method because it will make use of the prime number
    	tbl_size associated with a specific HashTable
    	@param key an arbitrary input string (used as key in a key-value pair)
    	@return The index in tbl_size where we will append this key-value pair to our list
    	*/
    	unsigned hash(const std::string &key) const;
    	static unsigned prime_generator(const unsigned &min);	// will generate the next prime at least as big as min
    public:
    	HashTable(const size_t size);
    	~HashTable() {delete [] h_tbl; }	// constructor will clearly use new []
    
    	// basic hash_table operations, should all have average runtime of O(1):
    	/**
    	@param item Item to insert into HashTable
    	*/
    	void insert(const KeyValue &item);
    	/**
    	@param key The key which is to be located
    	@return If the key is found at some index, returns an index where it is found
    	Note that if there are multiple instances of key, it will be unpredictable
    	which one is returned
    	If the key is not found anywhere, then -1 is returned
    	*/
    	int search(const std::string &key) const;
    	/**
    	@param key The key to find and delete. All instances of key should be deleted.
    	@return Returns true iff the key was found and something was deleted.
    	Returns false if key not found (in which case no action is taken).
    	*/
    	bool remove(const std::string &key);
    
    	/**
    	a beefier search method
    	@param key The key we are searching for
    	@return returns a vector of all values associated with the given key
    	The vector will be empty if no match is found
    	*/
    	std::vector<std::string> get_values(const std::string &key) const;
    
    	// methods to show performance
    	/**
    	This function is one measure of how well our hash function
    	is working. It returns the ratio of filled slots to available slots
    	in h_tbl--i.e., the percent of slots filled, expressed as a double
    	in [0, 1]
    	@return Number of slots with a non-empty list divided by total number of slots
    	*/
    	double showDensity() const;
    
    	/**
    	This method returns the length of the longest list to show
    	how many collisions we have had to deal with
    	@return length of longest KeyValue list.
    	*/
    	size_t longestListLength() const;
    };
    
    #endif

  3. #3
    Novice
    Join Date
    Jul 2009
    Posts
    568
    There appears to be a whole lot of comments and not a whole lot of code.

    What is your question, exactly?

  4. #4
    Registered User
    Join Date
    May 2009
    Posts
    242
    The question is just whether this is what the header file (hence essentially no implementations here) for a simple HashTable class should look like.

    I'm trying to understand how a hash table is supposed to work.

  5. #5
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    You don't really have a way to do a lot of common operations.

    I can't search for a value.

    I can't search for a range of values.

    I can't change the hashing mechanism.

    I can't cache the values for a set of keys for processing or manipulation.

    You are missing a lot.

    How do you envision using a hash table? Before you can write a proper implementation of a hash table, you must have experience with their usage.

    Soma

  6. #6
    Registered User
    Join Date
    May 2009
    Posts
    242
    As for the missing advanced functionality, I'm really only looking for the basics here so that I can simply understand what hashing with chaining is doing before trying to get fancy at all.

    btw, if hashing is just on the key of a key-value pair, doesn't searching for a value become O(n) with searching for a key O(1)?

    Anyhow, here's my (provisionally) final version of a very simplistic hash table with chaining. Header file:
    Code:
    /**
    @file
    hash table class
    @author Marshall Farrier
    @date 9/20/10
    */
    
    #ifndef HASH_TABLE_H
    #define HASH_TABLE_H
    
    #include <list>
    #include <utility>	// for pair to get key-value pairs
    #include <string>
    #include <vector>
    
    typedef std::pair<std::string, std::string> KeyValue;	// key-value pair where key and value are both strings
    typedef std::list<KeyValue> KeyValueList;
    const double SIZE_FACTOR = 1.33;
    
    /**
    This HashTable class uses hashing with chaining
    */
    class HashTable {
    private:	
    	KeyValueList * h_tbl;
    	unsigned tbl_size;	// size of the hash_table
    	/**
    	The hash function
    	Note that this is not a static method because it will make use of the prime number
    	tbl_size associated with a specific HashTable
    	Simple application of division method, CLRS, p. 263
    	@param key an arbitrary input string (used as key in a key-value pair)
    	@return The index in tbl_size where we will append this key-value pair to our list
    	*/
    	unsigned hash(const std::string &key) const;
    	static unsigned prime_generator(const unsigned &min);	// will generate the next prime at least as big as min
    
    	/**
    	@param max ceiling on value returned
    	@return The largest power of 2 that is <= max
    	If max == 0, 0 is returned (even though not a power of 2)
    	*/
    	static unsigned power_of_two(const unsigned &max);	// returns the biggest power of 2 that is < max
    public:
    	HashTable(const size_t size);
    	~HashTable() {delete [] h_tbl; }	// constructor will clearly use new []
    
    	// basic hash_table operations, should all have average runtime of O(1):
    	/**
    	@param item Item to insert into HashTable
    	*/
    	void insert(const KeyValue &item);
    	/**
    	@param key The key which is to be located
    	@return If the key is found at some index, returns an index where it is found
    	Note that if there are multiple instances of key, it will be unpredictable
    	which one is returned
    	If the key is not found anywhere, then -1 is returned
    	*/
    	int search(const std::string &key) const;
    	/**
    	@param key The key to find and delete. All instances of key should be deleted.
    	@return Returns true iff the key was found and something was deleted.
    	Returns false if key not found (in which case no action is taken).
    	*/
    	bool remove(const std::string &key);
    
    	/**
    	a beefier search method
    	@param key The key we are searching for
    	@return returns a vector of all values associated with the given key
    	The vector will be empty if no match is found
    	*/
    	std::vector<std::string> get_values(const std::string &key) const;
    
    	// methods to show performance
    	/**
    	This function is one measure of how well our hash function
    	is working. It returns the ratio of filled slots to available slots
    	in h_tbl--i.e., the percent of slots filled, expressed as a double
    	in [0, 1]
    	@return Number of slots with a non-empty list divided by total number of slots
    	*/
    	double density() const;
    
    	/**
    	This method returns the length of the longest list to show
    	how many collisions we have had to deal with
    	@return length of longest KeyValue list.
    	*/
    	size_t longest_chain() const;
    
    	/**
    	@return Returns the number of elements in the container
    	*/
    	size_t size() const;
    
    	/**
    	return Average number of elements stored in a chain (cf. CLRS, p. 258)
    	*/
    	double load_factor() const;
    };
    
    #endif
    Implementation file:
    Code:
    /**
    @file
    @author Marshall Farrier
    @date 9/20/10
    */
    
    #include "hash_table.h"
    #include <cmath>
    
    /**
    Just a basic deterministic hash function to begin with
    */
    unsigned HashTable::hash(const std::string &key) const {
    	unsigned result = 0;
    	// unsigned long power_of_base = 1;
    
    	const unsigned BASE = 128;	// small ASCII set
    	size_t len = key.length();
    	
    	for(int i = 0; i < len; ++i) {
    		// cf. Sedgewick, Algorithms in C, p. 578 (Horner's algorithm)
    		result = (result * BASE + (unsigned)key.at(i)) % tbl_size;
    	}
    
    	return result;
    }
    
    /**
    This implementation is horribly inefficient but should
    be ok for small values
    */
    unsigned HashTable::prime_generator(const unsigned &min) {
    	unsigned result = min;
    	unsigned max_factor, i;
    
    	while (true) {
    		max_factor = std::sqrt((double)result);
    		for (i = 2; i <= max_factor; ++i) {
    			if (result % i == 0) break;
    		}
    		if (i > max_factor) return result;	// result is prime
    		++result;
    	}
    }
    
    unsigned HashTable::power_of_two(const unsigned &max) {
    	unsigned max_cpy = max;
    	unsigned result = 1;
    	
    	while (max_cpy > 0) {
    		max_cpy >>= 1;
    		result <<= 1;
    	}
    	result >>= 1;
    
    	return result;
    }
    
    HashTable::HashTable(const size_t size) {
    	unsigned _size = size * SIZE_FACTOR;
    	unsigned pow_two = power_of_two(_size);
    	
    	/*
    	These operations are possibly overly cautious with regard to proximity
    	to powers of two
    	*/
    	if (_size < pow_two * 1.33) _size = pow_two * 1.4;
    	else if (pow_two * 1.67 < _size) _size = pow_two * 2.8;
    
    	tbl_size = prime_generator(_size);
    	h_tbl = new KeyValueList[tbl_size];
    }
    
    void HashTable::insert(const KeyValue &item) {
    	unsigned i = hash(item.first);
    	h_tbl[i].push_front(item);
    }
    
    int HashTable::search(const std::string &key) const {
    	int i = hash(key);
    	if (h_tbl[i].empty()) return -1;
    
    	KeyValueList::iterator it;
    	for (it = h_tbl[i].begin(); it != h_tbl[i].end(); ++it) {
    		if (key == it->first) return i;
    	}
    
    	return -1;
    }
    
    bool HashTable::remove(const std::string &key) {
    	int i = hash(key);
    	bool found = false;
    
    	if (h_tbl[i].empty()) return found;	
    	
    	KeyValueList::iterator it;
    	for (it = h_tbl[i].begin(); it != h_tbl[i].end(); ++it) {
    		if (key == it->first) {
    			found = true;
    			it = h_tbl[i].erase(it);
    			--it;
    		}
    	}
    
    	return found;
    }
    
    std::vector<std::string> HashTable::get_values(const std::string &key) const {
    	std::vector<std::string> result;
    	int i = hash(key);
    	
    	if (h_tbl[i].empty()) return result;
    
    	KeyValueList::iterator it;
    	for (it = h_tbl[i].begin(); it != h_tbl[i].end(); ++it) {
    		if (key == it->first) result.push_back(it->second);		
    	}
    
    	return result;
    }
    
    double HashTable::density() const {
    	int full = 0;
    	for (int i = 0; i < tbl_size; ++i) {
    		if (!h_tbl[i].empty()) ++full;
    	}
    	return (double)full / tbl_size;
    }
    
    size_t HashTable::longest_chain() const {
    	size_t result = 0;
    	for (int i = 0; i < tbl_size; ++i) {
    		if (result < h_tbl[i].size()) result = h_tbl[i].size();
    	}
    	return result;
    }
    
    size_t HashTable::size() const {
    	size_t result = 0;
    	for (int i = 0; i < tbl_size; ++i) {
    		result += h_tbl[i].size();
    	}
    	return result;
    }
    
    double HashTable::load_factor() const {
    	return (double)size() / tbl_size;
    }
    quick and simple demo file, which shows my table to have about half the slots occupied at all and a longest chain of 3. is that roughly normal?

    Code:
    /**
    @file
    demonstrates functionality of a hash table
    @author Marshall Farrier
    @date 9/20/10
    */
    
    #include <iostream>
    #include <string>
    #include <vector>
    #include "hash_table.h"
    
    using namespace std;
    
    void create_dictionary(string filename);
    
    int main() {
    	const size_t SIZE = 25;
    	HashTable h1(SIZE);
    
    	// list from Sedgewick, p. 576
    	string words[] = {"now", "for", "tip", "ilk", "dim", "tag", "jot",
    		"sob", "nob", "sky", "hut", "ace", "bet", "men", "egg", "few",
    		"jay", "owl", "joy", "rap", "gig", "wee", "was", "cab", "wad"};
    	KeyValue tmp;
    
    	for (int i = 0; i < SIZE; ++i) {
    		tmp.first = words[i];
    		h1.insert(tmp);
    	}
    
    	cout << "Hash table size is " << h1.size() << endl;
    	cout << "Hash table density is " << h1.density() << endl;
    	cout << "Longest chain is " << h1.longest_chain() << endl;
    	cout << "Load factor is " << h1.load_factor() << endl;
    
    	cout << "\"joy\" is in slot " << h1.search("joy") << endl;
    	cout << "\"jay\" is in slot " << h1.search("jay") << endl;
    
    	return 0;
    }

  7. #7
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    btw, if hashing is just on the key of a key-value pair, doesn't searching for a value become O(n) with searching for a key O(1)?
    It depends on the nature of the implementation (the goals and requirements of the implementation).

    Some implementations are designed for faster insertion sacrificing amortized constant time for key lookup.

    Some implementations are optimized for space sacrificing amortized constant time for any operation.



    You are implementing a "hash table" as a "hash map" expressed with an array. It is natural to express any "map" in such a way that searching for a value is expensive due to a focus on key lookup or insertion. That is not necessarily the case. Some variations of "B Tree" structures (even some "hash map" implementations using a "B+Tree" structure as the primary storage for each bucket) focus on ordering by value to get in order iteration without complex node traversal at the cost of key operations.



    I don't have time to look over the source. I may get a look at it later.

    Soma

  8. #8
    Registered User
    Join Date
    May 2009
    Posts
    242
    Thanks very much for your help! My professor also seems to prefer probing to chaining, but I still need to figure out what probing really is.

    What matters to me at this stage is just that my hash() method correctly implements the division method for creating a hash function where collisions are to be resolved with chaining. And i'd be interested to know whether the actual number of collisions (longest chain is 3 with my sample dataset) is fairly typical.

    As to your remarks about searching by value, I'm pretty sure that Python allows that in the dictionary data type, and that that's implemented as a hash table, but i don't know how value searches are done, although i'd be rather surprized if it was just an exhaustive run-through of all the data.

  9. #9
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by Aisthesis View Post
    Thanks very much for your help! My professor also seems to prefer probing to chaining, but I still need to figure out what probing really is.
    Probing sucks to be honest. No matter how you do it, you still end up doing most likely much more work than you do for chaining, and the worst part is that you basically throw away any chance of being able to easily perform deletions from the hash table.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  10. #10
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Truth.

    You will generally, almost always, see better performance with a bad bucket algorithm and a good underlying algorithm for each bucket than a clever bucket algorithm that has to resort to probing.

    Soma

  11. #11
    Registered User
    Join Date
    May 2009
    Posts
    242
    well, i'll see what gets said when we do probing sometime this week.

    as to the underlying algorithm for each bucket: If you set it up (as this one should normally be but not necessarily--e.g., if user made a HashTable with size 20, then proceeded to put thousands of items in it) to have a fairly low average depth in each bucket, would it be worth the overhead to make each bucket into some kind of sorted object or is it generally better just to use fast insertion without worrying about the internal retrieval time inside each bucket.

    i just made each bucket a list (a bucket list, so i see!!), but you then have to walk through the list until you encounter the end or your key, whereas a sorted object would improved searching within the bucket at the expense of adding overhead to insertion.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Help understanding Hash Table
    By boblettoj99 in forum C Programming
    Replies: 3
    Last Post: 12-29-2009, 01:23 PM
  2. Writing array, to file
    By zootreeves in forum C Programming
    Replies: 9
    Last Post: 09-08-2007, 05:06 PM
  3. Group Project Help/Volunteer
    By DarkDot in forum C++ Programming
    Replies: 3
    Last Post: 04-24-2007, 11:36 PM
  4. Hash table creation and insertion
    By tgshah in forum C Programming
    Replies: 1
    Last Post: 01-23-2006, 07:54 PM
  5. Not sure on hash table memory allocations
    By Thumper333 in forum C Programming
    Replies: 3
    Last Post: 09-27-2004, 09:00 PM