Doing proper hashing in a hash table?

**Elysia** · 01-13-2012

A question to those experienced with hash tables.
Suppose I have a hash table of dynamic size (eg, the size is not static).
Now the problem comes with hashing the key and putting it in the table. Since the hash of the key will likely be higher than the current size of my hash table, how do I properly fix that so it can fit inside the current table?

Example:
Say my size of the table is 100 elements.
I take some key, eg "html". I has this, and get some value 0x12345678.
This obviously cannot fit into a table with 100 elements, so I must fix this somehow. I can't just resize the table to that size. That would be ridiculous.

I have a hash function that I took off the web (I am not designing my own, mind you). Most of them, from what I see at least, do not take the size of the table into account.
Wrapping the hash value with the tables size seems like a bad idea since then I can't find the value again if the size of the table increases.

So what is the usual way of doing it?

**VirtualAce** · 01-13-2012

Most hash algorithms with perform a modulus on the hash based on the size of the table to keep the hash within range. At this point you will get collisions. How often you get them is based on how 'full' the current table is and the algorithm you are using to hash. As for a dynamic hash table I am not sure what you would do since the size would be ever changing. What is the dynamic nature of the table buying you that a normal hash table does not? And how is it dynamic? Resizing the hash table is going to be problematic since there could be objects anywhere in the hash table at any time. If you compact the hash table then you are removing possible empty locations in the table and are defeating the purpose of the structure.

**Elysia** · 01-13-2012

Obviously I am speaking of expanding the table as it gets fuller. Just sitting there eating a GB of memory while nearly empty is unacceptable.
I am not thinking about compacting a container. I don't think it's necessary, and would only add to the complexity.

**VirtualAce** · 01-13-2012

Ok that makes a bit more sense. You 'should' be able to then simply modulus with the new size of the table. So let's say the first object hashes to element 50 and a second object hashes to element 100. If you were to extend the size of the table 50 and 100 would not be bothered. After all the hash is only going to be used to find the element in the table. Once the element is found and returned to the client as long as that object or data stays at the element then everything is ok. When you expand the table you just have to remember to copy the existing table into the new data but your hash algorithm should still work as expected. Let's go back to the first size of the table which was 100. Obviously your hash algo is going to return values far outside that range so you used a modulus operation to keep the hash within range. Well if you increase the modulus size that just means that your hash will wrap around less and less as the table grows but should not affect any existing elements in the table. The only time I could see problems is if you were to shrink the size of the table but it still would not affect items within the table, unless an item was past the end of the new table. However you would not shrink a table smaller than the largest hash currently being used so this would never happen.

**Elysia** · 01-13-2012

The problem with modulus is that the hash is not constant.
Say our size is first 100. We get 250 from our hashing, so we mod it with 100 and get 50. OK.
Now say our table size expands to 150. We still get 250 from our hashing, so 250 % 150 = 100, NOT 50!
That means we will have problems locating our old value later if our size has expanded.
So this approach needs something else to work. Or some other solution entire, but I don't know what, exactly.

**Codeplug** · 01-13-2012

Most of the data-structures stuff I learned in college is on Wikipedia. A good place to start: Hash function - Wikipedia, the free encyclopedia

>> So what is the usual way of doing it?
The naive approach is to simply allocate the larger/smaller table, then re-hash all elements into the new table.
Other options are "Linear hashing" or "Extensible hashing": Hash function - Wikipedia, the free encyclopedia

Also take a look at other types of hash tables: Hash table - Wikipedia, the free encyclopedia

gg

**VirtualAce** · 01-13-2012

Ah yes you are correct. Slight oversight on my part. If you change the size of the table the elements that were hashed at the old size while not being affected will be inaccessible via the element index that was given to the client of the object if the table size has changed since the index was returned. Then you will need a more advanced data structure to handle this. Perhaps one of the ones that Codeplug referred to might work. One solution I could think of is to map element indices to table size at the time the element was hashed. Of course this is going to result in memory being used for this purpose and you will require O(log n) every time you hash which defeats the purpose of the table in the first place in which case you would just revert to using std::map.

This section on wiki might be of interest.

Variable range

In many applications, the range of hash values may be different for each run of the program, or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters—the input data z, and the number n of allowed hash values.

A common solution is to compute a fixed hash function with a very large range (say, 0 to 232−1), divide the result by n, and use the division's remainder. If n is itself a power of 2, this can be done by bit masking and bit shifting. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between 0 and n−1, for any n that may occur in the application. Depending on the function, the remainder may be uniform only for certain n, e.g. odd or prime numbers.

It is possible to relax the restriction of the table size being a power of 2 and not having to perform any modulo, remainder or division operation -as these operation are considered computational costly in some contexts. For example, when n is significantly less than 2b begin with a pseudo random number generator (PRNG) function P(key), uniform on the interval [0, 2b−1]. Consider the ratio q = 2b / n. Now the hash function can be seen as the value of P(key) / q. Rearranging the calculation and replacing the 2b-division by bit shifting right (>>) b times you end up with hash function n * P(key) >> b.

EDIT: Heh. My posted solution won't work either. I guess I should have gotten more sleep.

**Elysia** · 01-13-2012

I don't take it that it is a given that any dynamic hashing function will return the same hash for every item regardless of the table size? If so, they cannot be trusted anyway, and it would seem pointless to try that.
The only solution I can see so far is simply re-hasing everything on resize?

**phantomotap** · 01-14-2012

The only solution I can see so far is simply re-hasing everything on resize?

O_o

No; there is simply no best solution for every use so don't look for that.

Are you caching the hash local to the object the hash references? You can do some fun things with shared state that reduces the cost of "all at once" expansion.

How are you handling collisions? If you are already dealing with collisions with a probing strategy the cost of incremental expansion is generally trivial if amortized over multiple operations.

Are you aware that an operation chaining a "O(c)" operation followed by another "O(c)" operation is "O(c)"? An interface referencing a hash table containing a hash table can split the hash over "k-bits" so the cost of growth is never greater than "O(k)". If "k" is chosen to be a power of two relative to "n" obtaining both relevant portions costs only three or four cheap operations.

Is the table actually only an index instead of a container? Implementing the index as a client of a container will allow you to recreate dropped tables "on the fly" (amortized "O(c)") so operations that may reduce or grow the index can add or remove buckets by creating and then linking those buckets to the probe strategy that should reference the new bucket or remove those references as appropriate.

Is the intent to create a general purpose hash table? Focusing on a small memory footprint generally means spending cycles eventually rehashing or adopting a probing strategy that allows you to consider new buckets as they become available. Focusing on speed generally means a large precomputed hash expansion table with buckets behaving as B+Tree" sampling as few bits as possible.

Are you aware that some hash table implementations don't guarantee consistency of inserts only retrieves? If you are only interested in indexing through a hash table a fixed sized table that stores a full copy of the objects being hashed can easily satisfy those requirements in "O(n)" without needing probing or other techniques.

So, yeah, hashing everything on expansion or reduction is perfectly reasonable, but far from being the only solution.

Soma

**Elysia** · 01-14-2012

I didn't say it was the only solution. I only said it was the only solution I could see so far. That means that I haven't seen any solution yet that would work as well as re-hashing.
Anyway, I'm just testing the waters right now, an my implementation is rather simple (and incomplete). Just for reference:

Code:

template<typename Key_t, typename Type_t>
class XHashTable
{
public:
	XHashTable(float LoadFactor = 0.01): m_Items(0), m_Collisions(0), m_Values(100), m_LoadFactor(LoadFactor)
	{
		assert(LoadFactor > 0);
	}

	void Add(const Key_t& Key, const Type_t& Value)
	{
		m_Items++;
		assert(m_Items > 0);
		if (m_Values.size() / m_Items < m_LoadFactor)
		{
			m_Values.resize(m_Values.size() * 2);
			Rehash();
		}

		if (IFind(Key, nullptr))
			return;
		
		auto Hash = SpookyHash::Hash32(&Key[0], Key.size(), 0) % m_Values.size();
		auto pair = std::make_pair(Key, Value);
		auto & list = m_Values.at(Hash);
		if (list.size() > 0)
			m_Collisions++;
		assert(m_Collisions >= 0);
		list.push_back(pair);
	}

	Type_t& Find(const Key_t& Key)
	{
		Type_t temp;
		if (!IFind(Key, &temp))
			throw std::runtime_error("Key not found!");
		return temp;
	}

	const Type_t& Find(const Key_t& Key) const;
	unsigned int GetCollisions() { return m_Collisions; }
	std::size_t GetStorageSize() { return m_Values.size(); }
	unsigned int GetMapSize() { return m_Items; }

protected:
	bool IFind(const Key_t& Key, Type_t* Out)
	{
		auto Hash = SpookyHash::Hash32(&Key[0], Key.size(), 0) % m_Values.size();
		auto & list = m_Values.at(Hash);
		for (auto it = list.begin(); it != list.end(); ++it)
		{
			if (it->first == Key)
			{
				if (Out) *Out = it->second;
				return true;
			}
		}
		return false;
	}

	void Rehash()
	{
		for (std::size_t it = 0; it < m_Values.size(); ++it)
		{
			auto & list = m_Values.at(it);
			for (auto list_it = list.begin(); list_it != list.end();)
			{
				const auto & key = list_it->first;
				const auto & value = list_it->second;
				auto NewHash = SpookyHash::Hash32(&value, sizeof(value), 0) % m_Values.size();
				if (NewHash != it)
				{
					m_Values.at(NewHash).push_back(std::make_pair(key, value));
					auto old_it = list_it++;
					list.erase(old_it);
				}
				else
					++list_it;
			}
		}
	}

	std::vector<std::list<std::pair<Key_t, Type_t>>> m_Values;
	int m_Items;
	int m_Collisions;
	float m_LoadFactor;
};

What collision strategy to use is an interesting question, and one I'll think about later. For now, I'm simply using a linked list in each bucket.

SpookyHash is a hashing function that can be found on the web.
But the problem remains: everytime I expand the size, the hash will be different for the same key. How does one fix this without re-hashing? From what I gather, an algorithm that returns the same hash for every table size is impossible. It can only strive to fulfill it.

All I require is that all insertion that are made can later be retrieved by searching for the key. There can be no misses. If I inserted something with a key, then it must be able to be retrieved.
Also I am seeking at worst O(logN) lookup time and O(k*n) where k is as close to 1 as possible memory usage.

**phantomotap** · 01-14-2012

But the problem remains: everytime I expand the size, the hash will be different for the same key. How does one fix this without re-hashing?

O_o

That is wrong. You aren't looking at the problem correctly.

Did you read what I wrote? Did you read the articles Codeplug linked?

The hash should never be dependent on the size of the table! The selection of a bucket and the implementation of that bucket is what must be changed. Absolutely, the selection of a bucket will depend on the hash, but again, the hash does not depend on the bucket count. Because we aren't trying to conjure a magic hash function that samples the number of available buckets while still being stable when that count is unknown or changes we have many dozens of options for expansion without needing to generate new hashes. Which of these options is useful depends on if and how you are handling collisions and how buckets are implemented. Stop posting until you read; that is as straight forward as it is going to get.

All I require is that all insertion that are made can later be retrieved by searching for the key. There can be no misses. If I inserted something with a key, then it must be able to be retrieved.
Also I am seeking at worst O(logN) lookup time and O(k*n) where k is as close to 1 as possible memory usage.

You don't have that now; you aren't sorting the bucket contents (list) on insertion and linearly searching the buckets contents. Those performance characteristics can easily be met with a bucket implementation using any of several "B-Tree" variants and much else besides.

Flatly requiring consistency and, I'm assuming, availability on insertion and retrieval means that you must assume continual expansion, but that requirement alone doesn't demand that the bucket count ever be increased.

*shrug*

Look, it is nice to be able to talk about performance and memory with a shared vocabulary of "Big O", but the real world plays fast and loose with academics all the time.

Consider a bucket implementation using a shared "B+Tree" behind the scenes where each bucket only references an approximate position in the tree; you never need to rehash, have "O(n)" space requirements, get worst case "O(log n)" when properly implemented with a custom tree having back references to the buckets for every operation, and with enough work on your part as developer can "hot spot" certain areas of the tree by adding new buckets. This sounds awesome, but it is very difficult to get a correct implementation almost exclusively due to the difficulty of balancing and associating bucket references which requires some tree inspection on bucket count expansion and reduction with branching factor which influences the number of real collisions.

The big reveal of real life versus academics is harsh. If the client is only using the interface as a general purpose hash table while unaware of the implementation characteristics the performance of the shared "B+Tree" strategy is generally worse than the far simpler strategy of individual "B-Tree" buckets with a fixed count determined on initialization.

You need to stop searching for an answer to "How does one fix this without re-hashing?"; the answer is literally a meaningless "It depends.". You need to ask "What are the strategies for handling collisions in a hash table?" and "What data structures are useful for implementing hash table buckets?". If you go looking for the answers to those two questions you will no longer need to ask "How does one fix this without re-hashing?".

Isn't computer science grand?

Soma

**Elysia** · 01-14-2012

Originally Posted by phantomotap

O_o

That is wrong. You aren't looking at the problem correctly.

Did you read what I wrote? Did you read the articles Codeplug linked?

...

You need to stop searching for an answer to "How does one fix this without re-hashing?"; the answer is literally a meaningless "It depends.". You need to ask "What are the strategies for handling collisions in a hash table?" and "What data structures are useful for implementing hash table buckets?". If you go looking for the answers to those two questions you will no longer need to ask "How does one fix this without re-hashing?".

Isn't computer science grand?

Soma

Yes, I did read it, and from what I gather, there are several strategies.
One of them if rehashing. Another one is using a dynamic hashing algorithm.
However, as I asked before, I cannot for the life of me see a dynamic hashing algorithm that returns the same hash for the same key for all table sizes. So unless that holds, that seems pointless as I still have to do partial rehashing.

Perhaps I am missing some approach, and perhaps I am not fully understanding all that is written on Wikipedia (it isn't always easy), so I am looking for a "typical" solution on how to implement a dynamic hash table. It doesn't need to be grand. It just needs to work alright.

I'm aware that my current strategy won't get me guaranteed O(logN), but it's a start, and I'm more content with average O(logN) (that is to say, in my little test cases, the linked lists will be small enough that the key can be found with less than ln(N) operations). Until I've solved the problem with the hashing problem, then I can look into other methods of collision strategy (eg, a search tree or the like).

EDIT: For simplicity (I don't want to put too much work into this), a solution that works with the current collision strategy would be nice.

EDIT2: I'm just going to re-read all material and see if I can gleam anything new.

**brewbuck** · 01-14-2012

If you want a typical solution, the typical solution is re-hashing the elements into the new table. You can look around if you want but there's no secret technique out there that we're withholding from you!

The re-hashing can sometimes be done cleverly so that you don't need to recompute all the hash values, but the details are specific to the situation.

**Elysia** · 01-14-2012

Right, so I think I'll stick with that then. Performance does not appear to be horrible, but I fear that using a non-dynamic hash function may cause poor distribution (at least from what I've seen from my tests). Or it could just be that the algorithm I have is not suited for the source.
With a load factor of 0.9, I get roughly 50% collisions (half of the items inserted collide!).
Reducing it to 0.5 gets me around 30% collisions.
Although, the maximum chain length is 6. The number of items is around 194000, of which the natural logarithm is ~12, so it's still faster than a binary search tree, by a factor of two in the worst case scenario.
I have no idea if this is good or bad, but it seems bad.

**phantomotap** · 01-14-2012

O_o

I confess, I've never worked with "SpookyHash". I've just looked at the default implementations source. I admit, I only spent five minutes with the source.

That out of the way, it seems designed with good even distribution and avalanche effects when using the full 128 bits. You are not using the full 128 bits; you are using an unmixed 32 bit sample of those 128 bits. Actually, you aren't even using that; you are throwing away still more of the designed hash with modulus.

Try your same test data again, but this time use a hash that was designed from the ground up to use 32 bits, mix those 128 bits into a "n-bit" value where "n" is large enough to represent the bucket count, or fix the size so that it is always relative to a multiple of a prime that isn't modulo two and sample the "n-bits" needed by modules division as are you currently doing.

Again, there is probably nothing wrong with "SpookyHash", but there may be plenty wrong with how you are using it.

Hash values are a little like entropy: you need to savor every bit.

I doubt this is significant because interesting values employed in the real world rarely result in the equally distributed buckets we all kind of wish they would.

With a load factor of 0.9, I get roughly 50% collisions (half of the items inserted collide!).

The "Birthday Paradox" has an alarming tendency to rape even the most advanced collision handling techniques in test suites.

Personally, I think reality plans it that way.

Seriously though, a 90% load factor looks broken in most any implementation.

The number of items is around 194000, of which the natural logarithm is ~12, so it's still faster than a binary search tree, by a factor of two in the worst case scenario.

[Edit]
Yes, I realize you didn't say how many buckets the example had. I'm just giving examples of my own.
[/Edit]

^_^;

I'm not sure what relevance the natural logarithm has to a binary search tree that isn't a constant factor. (Joke.)

Anyway, those results aren't bad. I'd guess that you just don't know enough about hashing in practice. I wouldn't consider a hash table with 64k buckets referencing 256,000 elements broken due to hashing even with ~1,800 collisions in a single bucket. (That's actually just a guess; the mathematics is very easy but I'm still too lazy to actually do them.)

The point is, you should expect a hash table with "2^k" buckets and "2^n" elements to have significantly more collisions than "2^n/2^k" would suggest. The influence "k" has over "n" isn't even the relationship you probably expect as increasing "k" can just as easily lead to more collisions.

Once again, dealing with the question "How do I expand the bucket count without rehashing?" isn't going to be as important as "How do I deal with collisions?" for a general implementation.

Soma

Thread: Doing proper hashing in a hash table?

Thread Tools

Search Thread

Display

Doing proper hashing in a hash table?

Similar Threads

Hash table help

hash table

What hashing algorithm outputs hash value as numbers only?

Hash Table

Hash table