Thread: hash tables

  1. #1
    Registered User ITAmember's Avatar
    Join Date
    Mar 2009
    Location
    Nebraska
    Posts
    72

    hash tables

    I'm looking to implement my own hash table system. All keys and values will be the same type. (4 bytes) No shared keys/values, every key must me unique and each key will have it's own value. I was thinking I could make each key/value pair a 8 byte block of memory, the question is, should I make an array or linked list? If I made a linked list it would have to be either 12 bytes for single link or 16 bytes for double link. What do you guys think?

  2. #2
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    I would probably make an array of singly-linked-lists, or rather the array is just of pointers to the list items. In fact if I know I'm always going to have less than 65535 items, then rather than pointers I use 2-byte indexes to items from a second buffer, used as the linked-list storage area. The links themselves are then only 2 bytes, and the "list" is effectively pool allocated.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  3. #3
    pwning noobs Zlatko's Avatar
    Join Date
    Jun 2009
    Location
    The Great White North
    Posts
    132
    I suppose the whole advantage of hash tables is to have constant time access to your "buckets" so a linked list would not be right. You'd need an array. Just curious, how do you guarantee that multiple values will not hash to the same key?

  4. #4
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    Just curious, how do you guarantee that multiple values will not hash to the same key?
    I'll answer that for you: he can't.

    @ITAmember: Before you start implementing your hash table (do it as an array as iMalc and Slatko suggested), you need to look into how you will handle hash collisions. The Hash Table Wiki article has explanations of some different techniques.

  5. #5
    Registered User ITAmember's Avatar
    Join Date
    Mar 2009
    Location
    Nebraska
    Posts
    72
    Yes you can check if it has the same key, loop through and check keys, if you find a match edit the value instead of adding a new key/value pair.

  6. #6
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    Except that with hash tables, you index via the hash - not the key. What you are talking about is not a hash table.

  7. #7
    Registered User ITAmember's Avatar
    Join Date
    Mar 2009
    Location
    Nebraska
    Posts
    72
    So what is it than?

  8. #8
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    Just a customized data structure I guess. You would be using an array similar to a hash table, but obviously you can't call it a hash table if you're not creating hashes.

    Just out of curiosity, are your key values in a known range? One of the main reasons for creating hashes is to guarantee that your keys fall into a known range of 0...n where n is the size of your array. If your key is an arbitrary 4 byte number, then your array would need to have over 4 billion indexes in it (which obviously would not work).

  9. #9
    Registered User ITAmember's Avatar
    Join Date
    Mar 2009
    Location
    Nebraska
    Posts
    72
    Thank you for clearing that up than. The keys are not numbers, they are void pointers. So I need a dynamic array not a linked list right?

  10. #10
    pwning noobs Zlatko's Avatar
    Join Date
    Jun 2009
    Location
    The Great White North
    Posts
    132
    Are you doing this as an exercise to learn hash tables, or are you doing it professionally. I ask because you might not need the performance of a fancy data structure unless you have a large number of data elements. How do define large depends on the hardware you're running on. If I was doing it professionally, I'd try the simple linear search first and if performance was below requirements, then I'd optimize. I'd put my implementation behind an interface so I could change it without breaking the code using it.

    If you're doing it to learn, then at least look at the graphics of the wikipedia article. As bithub said you cannot have an array large enough to hold all possible pointers. You need a hash function to map the pointer to some small array and handle collisions by hanging linked lists off of the array. For example, take a pointer, made of 4 bytes. Add the bytes together and modulo 255. That is a hash function that maps all your pointers to a range of 0 to 255. You have an array of 255 "buckets" and each bucket has a linked list of key/value pairs. The idea is to find the bucket quickly, and hopefully not have to examine many key/value pairs before finding your data. Changing the hash function, and the size of your bucket array will give different performance results.

    Imagine a worst case scenario, all your keys hash to 1 bucket. Then all your data is in one linked list. Thats back to a linear search.

    When you make your hash table you should build in the ability to monitor the size of the linked lists hanging off the buckets. They should all be of similar length and not too long. If some are very long, and others are empty, your hash function is not good. If they're all long, but uniform, you need a bigger bucket array and a different hash function.

  11. #11
    Registered User ITAmember's Avatar
    Join Date
    Mar 2009
    Location
    Nebraska
    Posts
    72
    I see what you mean, I wasn't sure what a hash table was. Anyway, thanks. I think I can take it from here.

  12. #12
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    When you make your hash table you should build in the ability to monitor the size of the linked lists hanging off the buckets. They should all be of similar length and not too long. If some are very long, and others are empty, your hash function is not good. If they're all long, but uniform, you need a bigger bucket array and a different hash function.
    Also, make sure you're not getting clusters of hashes*, since that'll slow insertion / lookup depending on your collision-resolution (say linear probing). Most implementations usually dub the hash-table as "useless" at 0.5-0.75 load (ones I've seen anyway), the theory behind it is vital.

    * Since in the grand scheme of things the addresses in your void pointers are probably going to be very close to one another.

    On a side note, you could also look at other alternatives such as tree-maps, which offer O(log N) lookups and insertions in comparison to O(1) for hash tables, but they're less wasteful and come with the advantage of being "sorted" if that helps .
    Last edited by zacs7; 06-29-2009 at 10:49 PM.

  13. #13
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    On a side note, you could also look at other alternatives such as tree-maps, which offer O(log N) lookups and insertions in comparison to O(1) for hash tables, but they're less wasteful and come with the advantage of being "sorted" if that helps
    Tree maps only have O(log N) lookups and insertions if a balanced binary tree implementation is used. These are non-trivial to implement, and still work better if the keys are hashed (which removes the advantage of the map being sorted). For a good balanced binary tree implementation (If you want to go this route), I suggest using a Red-Black tree. This is the same tree implementation used by gcc to implement std::map. Prelude wrote a pretty good tutorial on Red-Black trees here. In my opinion, the main advantage to using a Hash-Tree instead of a Hash-Table is that you don't have to worry about growing the table once you his a certain threshold. Saving some space is always nice too.

  14. #14
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by bithub View Post
    I'll answer that for you: he can't.
    Depends how you define "hash table." Since the keys are 32-bit values, an array of 2^32 hash buckets, with the hash function being the identity function, would suffice.

    Of course, that would require 32 gigabytes of memory, so it may not be the best solution

    Any good hash function on the integers should be a bijection (which means the hash function is collision free), and does not map adjacent key values to adjacent hash values. (Look here for a good summary of integer hash functions) The collisions only happen because you limit the number of buckets.

    The two possibilities for resolving bucket collisions are bucket chaining (the linked list which was already mentioned) and probing, where you use some fixed rule to probe the buckets until you either find the key you're looking for, or you find an empty bucket to insert a new value.

    In the case of bucket chaining, you must monitor the length of the chains and re-hash when any single chain becomes too long, where "too long" is an engineering decision.

    In the case of hash probing, the optimal probing strategy depends on the target density of values in the table.

    Chaining has the advantage of quicker insertion, because it takes O(1) time to allocate a node and add it to the chain. As far as lookup, in either case you must search the chain or probe the table to find the key.

    So if insertion time is more important that lookup time unusual), chaining might be a better choice. On the other hand, chaining adds at least an extra pointer's worth of memory to each hash table entry, as well as any dynamic memory overhead (this can be made zero by designing a custom allocator). It also makes it less likely that the entire hash table will reside in cache during repeated lookups.

    Decisions, decisions. What's the application?
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  15. #15
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by bithub View Post
    For a good balanced binary tree implementation (If you want to go this route), I suggest using a Red-Black tree. This is the same tree implementation used by gcc to implement std::map.
    While it's true that any tree with constant branching factor can perform insertion/lookup in O(n log n) (EDIT: Oops, meant O(log n) ) there is still a constant factor hiding there which isn't really negligible in practice. For evenly distributed sparse sets of integers it can be better to use a tree of higher branching factor, i.e. radix tree or some other.

    EDIT: Eh, radix tree wasn't exactly what I meant. My point is a tree can be a good alternative to a hash but it doesn't necessarily have to be a binary tree.
    Last edited by brewbuck; 06-30-2009 at 10:50 AM.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Iterable hash tables
    By OnionKnight in forum Tech Board
    Replies: 5
    Last Post: 07-21-2008, 02:02 AM
  2. developing hash tables in the C Programming language
    By w108dab in forum C Programming
    Replies: 1
    Last Post: 05-20-2008, 11:20 AM
  3. need help with hash tables with direct chaining
    By agentsmith in forum C Programming
    Replies: 4
    Last Post: 01-05-2008, 04:19 AM
  4. Group Project Help/Volunteer
    By DarkDot in forum C++ Programming
    Replies: 3
    Last Post: 04-24-2007, 11:36 PM
  5. Problems with Hash Tables
    By Zildjian in forum C Programming
    Replies: 6
    Last Post: 11-06-2003, 08:53 PM