hash tables

**ITAmember** · 06-29-2009

I'm looking to implement my own hash table system. All keys and values will be the same type. (4 bytes) No shared keys/values, every key must me unique and each key will have it's own value. I was thinking I could make each key/value pair a 8 byte block of memory, the question is, should I make an array or linked list? If I made a linked list it would have to be either 12 bytes for single link or 16 bytes for double link. What do you guys think?

**iMalc** · 06-29-2009

I would probably make an array of singly-linked-lists, or rather the array is just of pointers to the list items. In fact if I know I'm always going to have less than 65535 items, then rather than pointers I use 2-byte indexes to items from a second buffer, used as the linked-list storage area. The links themselves are then only 2 bytes, and the "list" is effectively pool allocated.

**Zlatko** · 06-29-2009

I suppose the whole advantage of hash tables is to have constant time access to your "buckets" so a linked list would not be right. You'd need an array. Just curious, how do you guarantee that multiple values will not hash to the same key?

**bithub** · 06-29-2009

Just curious, how do you guarantee that multiple values will not hash to the same key?

I'll answer that for you: he can't.

@ITAmember: Before you start implementing your hash table (do it as an array as iMalc and Slatko suggested), you need to look into how you will handle hash collisions. The Hash Table Wiki article has explanations of some different techniques.

**ITAmember** · 06-29-2009

Yes you can check if it has the same key, loop through and check keys, if you find a match edit the value instead of adding a new key/value pair.

**bithub** · 06-29-2009

Except that with hash tables, you index via the hash - not the key. What you are talking about is not a hash table.

**ITAmember** · 06-29-2009

So what is it than?

**bithub** · 06-29-2009

Just a customized data structure I guess. You would be using an array similar to a hash table, but obviously you can't call it a hash table if you're not creating hashes.

Just out of curiosity, are your key values in a known range? One of the main reasons for creating hashes is to guarantee that your keys fall into a known range of 0...n where n is the size of your array. If your key is an arbitrary 4 byte number, then your array would need to have over 4 billion indexes in it (which obviously would not work).

**ITAmember** · 06-29-2009

Thank you for clearing that up than.

The keys are not numbers, they are void pointers. So I need a dynamic array not a linked list right?

**Zlatko** · 06-29-2009

Are you doing this as an exercise to learn hash tables, or are you doing it professionally. I ask because you might not need the performance of a fancy data structure unless you have a large number of data elements. How do define large depends on the hardware you're running on. If I was doing it professionally, I'd try the simple linear search first and if performance was below requirements, then I'd optimize. I'd put my implementation behind an interface so I could change it without breaking the code using it.

If you're doing it to learn, then at least look at the graphics of the wikipedia article. As bithub said you cannot have an array large enough to hold all possible pointers. You need a hash function to map the pointer to some small array and handle collisions by hanging linked lists off of the array. For example, take a pointer, made of 4 bytes. Add the bytes together and modulo 255. That is a hash function that maps all your pointers to a range of 0 to 255. You have an array of 255 "buckets" and each bucket has a linked list of key/value pairs. The idea is to find the bucket quickly, and hopefully not have to examine many key/value pairs before finding your data. Changing the hash function, and the size of your bucket array will give different performance results.

Imagine a worst case scenario, all your keys hash to 1 bucket. Then all your data is in one linked list. Thats back to a linear search.

When you make your hash table you should build in the ability to monitor the size of the linked lists hanging off the buckets. They should all be of similar length and not too long. If some are very long, and others are empty, your hash function is not good. If they're all long, but uniform, you need a bigger bucket array and a different hash function.

**ITAmember** · 06-29-2009

I see what you mean, I wasn't sure what a hash table was.

Anyway, thanks.

I think I can take it from here.

**zacs7** · 06-29-2009

When you make your hash table you should build in the ability to monitor the size of the linked lists hanging off the buckets. They should all be of similar length and not too long. If some are very long, and others are empty, your hash function is not good. If they're all long, but uniform, you need a bigger bucket array and a different hash function.

Also, make sure you're not getting clusters of hashes*, since that'll slow insertion / lookup depending on your collision-resolution (say linear probing). Most implementations usually dub the hash-table as "useless" at 0.5-0.75 load (ones I've seen anyway), the theory behind it is vital.

* Since in the grand scheme of things the addresses in your void pointers are probably going to be very close to one another.

On a side note, you could also look at other alternatives such as tree-maps, which offer O(log N) lookups and insertions in comparison to O(1) for hash tables, but they're less wasteful and come with the advantage of being "sorted" if that helps

.

**bithub** · 06-29-2009

On a side note, you could also look at other alternatives such as tree-maps, which offer O(log N) lookups and insertions in comparison to O(1) for hash tables, but they're less wasteful and come with the advantage of being "sorted" if that helps

Tree maps only have O(log N) lookups and insertions if a balanced binary tree implementation is used. These are non-trivial to implement, and still work better if the keys are hashed (which removes the advantage of the map being sorted). For a good balanced binary tree implementation (If you want to go this route), I suggest using a Red-Black tree. This is the same tree implementation used by gcc to implement std::map. Prelude wrote a pretty good tutorial on Red-Black trees here. In my opinion, the main advantage to using a Hash-Tree instead of a Hash-Table is that you don't have to worry about growing the table once you his a certain threshold. Saving some space is always nice too.

**brewbuck** · 06-29-2009

Originally Posted by bithub

I'll answer that for you: he can't.

Depends how you define "hash table." Since the keys are 32-bit values, an array of 2^32 hash buckets, with the hash function being the identity function, would suffice.

Of course, that would require 32 gigabytes of memory, so it may not be the best solution

Any good hash function on the integers should be a bijection (which means the hash function is collision free), and does not map adjacent key values to adjacent hash values. (Look here for a good summary of integer hash functions) The collisions only happen because you limit the number of buckets.

The two possibilities for resolving bucket collisions are bucket chaining (the linked list which was already mentioned) and probing, where you use some fixed rule to probe the buckets until you either find the key you're looking for, or you find an empty bucket to insert a new value.

In the case of bucket chaining, you must monitor the length of the chains and re-hash when any single chain becomes too long, where "too long" is an engineering decision.

In the case of hash probing, the optimal probing strategy depends on the target density of values in the table.

Chaining has the advantage of quicker insertion, because it takes O(1) time to allocate a node and add it to the chain. As far as lookup, in either case you must search the chain or probe the table to find the key.

So if insertion time is more important that lookup time unusual), chaining might be a better choice. On the other hand, chaining adds at least an extra pointer's worth of memory to each hash table entry, as well as any dynamic memory overhead (this can be made zero by designing a custom allocator). It also makes it less likely that the entire hash table will reside in cache during repeated lookups.

Decisions, decisions. What's the application?

**brewbuck** · 06-29-2009

Originally Posted by bithub

For a good balanced binary tree implementation (If you want to go this route), I suggest using a Red-Black tree. This is the same tree implementation used by gcc to implement std::map.

While it's true that any tree with constant branching factor can perform insertion/lookup in O(n log n) (EDIT: Oops, meant O(log n) ) there is still a constant factor hiding there which isn't really negligible in practice. For evenly distributed sparse sets of integers it can be better to use a tree of higher branching factor, i.e. radix tree or some other.

EDIT: Eh, radix tree wasn't exactly what I meant. My point is a tree can be a good alternative to a hash but it doesn't necessarily have to be a binary tree.

Thread: hash tables

Thread Tools

Search Thread

Display

hash tables

Similar Threads

Iterable hash tables

developing hash tables in the C Programming language

need help with hash tables with direct chaining

Group Project Help/Volunteer

Problems with Hash Tables