# string hashing algorithm

This is a discussion on string hashing algorithm within the Tech Board forums, part of the Community Boards category; I need an algorithm to hash strings (titles of wikipedia articles). How does this look? Any particular weaknesses? (this is ...

1. ## string hashing algorithm

I need an algorithm to hash strings (titles of wikipedia articles). How does this look? Any particular weaknesses?

(this is inspired by the Zobrist hashing algorithm Zobrist hashing - Wikipedia, the free encyclopedia)

Code:
```unsigned int keys[256][MAX_STRING_LENGTH]; //filled with randomly generated numbers

unsigned int hash_string(std::string s) {

unsigned int hash = 0;

for (size_t i = 0; i < s.length(); ++i) {
hash ^= zobrist_keys[s[i]][i];
}

return hash;
}```
There are about 4000000 articles (and hence titles), so I thought 32-bit hash should be good enough?

2. Looks like a decent hash to me. But depending how big MAX_STRING_LENGTH is it might not be very cache efficient.

With a small change:

Code:
```#define MAX_ZOB_SIZE ???

unsigned int keys[256][MAX_ZOB_SIZE];

unsigned int hash_string(std::string s) {

unsigned int hash = 0;

for (size_t i = 0; i < s.length(); ++i) {
hash ^= zobrist_keys[s[i]][i % MAX_ZOB_SIZE];
}

return hash;
}```
You are not limited to a maximum string length, and you can control how big each Zobrist bucket is. You might find that smaller MAX_ZOB_SIZE gives better performance without hurting the hash all that much.

EDIT: Other notes... You're passing s by value, which is gonna hurt. And the compiler might optimize the comparison with s.length(), but maybe not..

3. Ah I see.

Thanks. That makes sense.

I just Googled for some existing code, and found this -
http://www.cse.yorku.ca/~oz/hash.html

djb2
this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
Code:
```    unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}```
I don't usually like to use code that I don't understand (how good the hash is), but this one is just so much more convenient... and easy on the cache, too.

4. This is the hashing algorithm I've been using. It is the same as bernstein's except it uses XOR (like the comment in the link suggests).

Code:
```unsigned int
get_hash(const char* s)
{
unsigned int hash = 0;
int c;

while((c = *s++))
{
/* hash = hash * 33 ^ c */
hash = ((hash << 5) + hash) ^ c;
}

return hash;
}```

5. The DJB2 hash (which is what has been posted here) is an excellent hash algorithm.