Originally Posted by
rcgldr
So each element within the hash table array is some form of linked list structure?
I assume the OP is trying to find examples on how to implement separate chaining. [Ah yes, noted in a later post, after I started writing this one.]
The implementation is straightforward. My preferred one is using a structure with a copy of the hash value (especially when the key is a string), so that values do not need to be rehashed when resizing the hash table. If the hash values are strings, I also include the length of the string, it comes in handy that often. In all, a typical implementation for hashed string might look like
Code:
struct hash_entry {
struct hash_entry *next;
size_t size;
unsigned long hash;
char *data;
};
typedef struct {
size_t size; /* Number of entries in 'entry' array */
size_t used; /* Number of hashed items */
struct hash_entry **entry;
} hash_table_t;
If the hash table size is fixed, and at most one thread at a time removes entries from the table, you can use compiler-provided atomic built-ins for the operations to be lockless. (Multiple concurrent removes must be ordered, so usually you have a mutex that is only taken for removes.)
For single-threaded applications, or when the hash table itself is protected by a mutex (but remember that means all multithreaded accesses will be sequential, no concurrency), you can use the used value to grow the table when necessary.
The index in the hash table is always (hash % table->size). For ASCII data, a fast and simple hash function I like is DJB2 XOR-variant hash,
Code:
unsigned long hash(const void *const data, const size_t size)
{
const unsigned char *p = (const unsigned char *)data;
const unsigned char *const q = (const unsigned char *)data + size;
unsigned long h = 5381UL;
while (p < q)
h = (h * 33UL) ^ (unsigned long)(*(p++));
return h;
}
There are many who complain about it, saying e.g. how one should hash four characters at a time. Sure, but strings are rarely aligned to suitable boundaries with zero paading, and on many architectures, unaligned accesses are slow. A hash that yields good results fast has to be more complicated. The above is simple and good and fast enough for many use cases.
You do not need to save the hash value itself in the hash entry structure, but you can use it to make searches much faster in the cases when there are multiple elements in the chain. For example, ignoring the atomic accesses needed for multithreaded lockless operation, a find operation is just:
Code:
struct hash_entry *find_hashed(const hash_table_t *const table, const void *const data, const size_t size)
{
const unsigned long h = hash(data, size);
struct hash_entry *entry = table->entry[h % table->size];
while (entry)
if (entry->hash != h || entry->size != size ||
memcmp(entry->data, data, size))
entry = entry->next;
else
return entry;
return NULL;
}
DJB2 hash produces small hashes for very short strings, but if the table is relatively small, say 1000 entries, and the strings at least dozen characters (for 64-bit unsigned longs), then it is very unlikely you ever have to compare any non-matching strings, because most entries typically have an unique hash. (Their modulus to current hash table size just happens to be the same, that's why they're in the same hash table entry chain.)
As you can see, the data key is first hashed, and the slot in the hash table probed. If the slot is not empty, it's treated as a chain. The chain is scanned, until an entry that matches the hash, size, and the data key is found. Remember that C uses short-circuit evaluation, so the memcmp() is only done if both hash and size already match.
If you use a high load factor, i.e. relatively small table size compared to number of elements used (or if there are lots of strings of the same length that hash the same, and you for some reason cannot change the hash function), then you can sort the entries in the chain. The simplest way is to just insert new entries in the sorted position in the chain. However, normally this is not worth the effort, as most chains are at most a few entries long.
If elements are not in sorted order, then I personally tend to just prepend new hash entries to their respective chains.
Questions?