# Thread: Doing proper hashing in a hash table?

1. I see. So I should consider this equal to an average hash table solution. Not too shabby; not perfect.
Maybe I should try to find out the average chain length.
Right now, I am just treating all elements as buckets. One bucket keeps one element in the perfect scenario.

2. Chaining is only one way of dealing with collisions. Another option is linear probing -- if the hash of the element is H, and the table[H] already exists, then insert it at table[H+1] instead (and if that's filled, go to the next and the next until an empty spot is found).

To generalize linear probe, if table[H] collides, set H = collide(H) and try again. Linear probing means collide(H) = H+1.

Or, multi-hashing. If Hash(element) is H, and H is a collision, then use Hash2(element), and if that collides go to Hash3(element), for as many hash functions as you've got.

3. I know, but it's not like it's always better. It all depends on the circumstances, so I am simply going to keep it simple and go with the chaining method.

4. Originally Posted by brewbuck
Chaining is only one way of dealing with collisions. Another option is linear probing -- if the hash of the element is H, and the table[H] already exists, then insert it at table[H+1] instead (and if that's filled, go to the next and the next until an empty spot is found).

To generalize linear probe, if table[H] collides, set H = collide(H) and try again. Linear probing means collide(H) = H+1.

Or, multi-hashing. If Hash(element) is H, and H is a collision, then use Hash2(element), and if that collides go to Hash3(element), for as many hash functions as you've got.
The only hash tables I've implemented used either a link list or a b-tree in the bucket, so I'm curious: how do you then find this item later? If I insert a key "abracadabra" and it gets multi-hashed, when someone retrieves abracadabra, do you go thru the same process (try hash1, if that doesn't return abracadabra, try hash2, etc)? Seems to me this is no advantage over a linked list but has a clear disadvantage (the number of multi-hashes must still be finite).

WRT linear probing, isn't it problematic? What if abracadabra wasn't actually added and someone asked for it? When do you know when to stop looking? At the end of the table?? So now instead of an O(1) lookup you have a O(n) array.

I presume there is a clever solution here

5. Originally Posted by MK27
The only hash tables I've implemented used either a link list or a b-tree in the bucket, so I'm curious: how do you then find this item later? If I insert a key "abracadabra" and it gets multi-hashed, when someone retrieves abracadabra, do you go thru the same process (try hash1, if that doesn't return abracadabra, try hash2, etc)?
Pretty much.

Seems to me this is no advantage over a linked list but has a clear disadvantage (the number of multi-hashes must still be finite).
No overhead for inserting into a linked list, and potentially faster, depending on speed of hashing and, of course, how much it avoids collisions.

WRT linear probing, isn't it problematic? What if abracadabra wasn't actually added and someone asked for it? When do you know when to stop looking? At the end of the table?? So now instead of an O(1) lookup you have a O(n) array.
I assume that's why linear probing is a dumb algorithm only worth studying for academic purposes, like insertion sort.
The same thing happens on insert: it can take forever (theoretically) to find an empty bucket.

6. Linear probing is rarely implemented in such a way as to result in actual linear performance. Expansion (increase in the number of buckets) is a necessary evil to get good performance out of simple linear probing as with any hypothetical hash table that always expands on collision.

Repeated hashing for hash table implementations mostly come in two forms.

"Double Hashing" (common) uses two entirely different hash algorithms. If the first search fails the second hashing algorithm is recursively called for the next attempts. If the number of tries is the same as the number of buckets you get loose linear behavior, but in practice the number of attempts is significantly less than the number of buckets.

"Multi-hashing" (uncommon) uses a hash algorithm that requires the hash to be finalized in some specific way. These implementations hash for a specific number of possibilities while only being finalized if a given search doesn't succeed. Generally the values added after processing the actual key are chosen to be uniformly distributed. The number of possibilities is generally fixed for the algorithm instead of the number of buckets so this is rarely linear.

In practice, most any hash table using simple probing may increment a bucket counter on rehashing to account for the existence of the key at a different location. Insertion suffers a constant cost for the already non-constant collision handling algorithm allowing searching to be constant if the key would never be found. Removal of a pair requires walking the table so this strategy isn't really acceptable for a general purpose hash table. This is possible with some repeated hashing strategies (mostly multi-hashing) but problematic or prohibitively expensive depending on the hash algorithms chosen.

That said, the above only applies to fairly simple strategies. If expansion occurs at a given load and the appropriate hash algorithms are used constant time search and insert can be achieved with linear probing.

In other words, linear probing and repeated hashing aren't the algorithm itself; they are strategies that guide the choice of algorithm. Those algorithms range from exceedingly simple (If a bucket isn't available try the very next bucket.) to alarmingly complex (Use this family of hashing functions associated with this mutation box for this number of buckets.).

Soma

7. There is also quadratic probing and other forms of probing that address some of the shortcomings of linear probing all the while introducing their own set of unique issues.

8. Yeah, I'm aware. Anyway, this project has served my needs now, so I'm archiving it.

9. On a side note, the most important advantage to probing hash maps (as opposed to chains or trees or other data structures rooted in the buckets) is its caching and allocation behavior. Dangling a linked list from a bucket means you have to allocate nodes, and they're gonna be all over memory. With probing, the only allocations you do are for the core array.