# string hashing algorithm

• 04-10-2009
cyberfish
string hashing algorithm
I need an algorithm to hash strings (titles of wikipedia articles). How does this look? Any particular weaknesses?

(this is inspired by the Zobrist hashing algorithm Zobrist hashing - Wikipedia, the free encyclopedia)

Code:

```unsigned int keys[256][MAX_STRING_LENGTH]; //filled with randomly generated numbers unsigned int hash_string(std::string s) {                 unsigned int hash = 0;                 for (size_t i = 0; i < s.length(); ++i) {                 hash ^= zobrist_keys[s[i]][i];         }                 return hash; }```
There are about 4000000 articles (and hence titles), so I thought 32-bit hash should be good enough?
• 04-10-2009
brewbuck
Looks like a decent hash to me. But depending how big MAX_STRING_LENGTH is it might not be very cache efficient.

With a small change:

Code:

```#define MAX_ZOB_SIZE ??? unsigned int keys[256][MAX_ZOB_SIZE]; unsigned int hash_string(std::string s) {         unsigned int hash = 0;                 for (size_t i = 0; i < s.length(); ++i) {                 hash ^= zobrist_keys[s[i]][i % MAX_ZOB_SIZE];         }         return hash; }```
You are not limited to a maximum string length, and you can control how big each Zobrist bucket is. You might find that smaller MAX_ZOB_SIZE gives better performance without hurting the hash all that much.

EDIT: Other notes... You're passing s by value, which is gonna hurt. And the compiler might optimize the comparison with s.length(), but maybe not..
• 04-10-2009
cyberfish
Ah I see.

Thanks. That makes sense.

I just Googled for some existing code, and found this -
http://www.cse.yorku.ca/~oz/hash.html

Quote:

djb2
this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
Code:

```    unsigned long     hash(unsigned char *str)     {         unsigned long hash = 5381;         int c;         while (c = *str++)             hash = ((hash << 5) + hash) + c; /* hash * 33 + c */         return hash;     }```

I don't usually like to use code that I don't understand (how good the hash is), but this one is just so much more convenient... and easy on the cache, too.
• 04-10-2009
bithub
This is the hashing algorithm I've been using. It is the same as bernstein's except it uses XOR (like the comment in the link suggests).

Code:

```unsigned int get_hash(const char* s) {     unsigned int hash = 0;     int c;     while((c = *s++))     {         /* hash = hash * 33 ^ c */         hash = ((hash << 5) + hash) ^ c;     }     return hash; }```
• 04-11-2009
VirtualAce
The DJB2 hash (which is what has been posted here) is an excellent hash algorithm.