Thread: string hashing algorithm

  1. #1
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229

    string hashing algorithm

    I need an algorithm to hash strings (titles of wikipedia articles). How does this look? Any particular weaknesses?

    (this is inspired by the Zobrist hashing algorithm Zobrist hashing - Wikipedia, the free encyclopedia)

    Code:
    unsigned int keys[256][MAX_STRING_LENGTH]; //filled with randomly generated numbers
    
    unsigned int hash_string(std::string s) {
    	
    	unsigned int hash = 0;
    	
    	for (size_t i = 0; i < s.length(); ++i) {
    		hash ^= zobrist_keys[s[i]][i];
    	}
    	
    	return hash;
    }
    There are about 4000000 articles (and hence titles), so I thought 32-bit hash should be good enough?

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Looks like a decent hash to me. But depending how big MAX_STRING_LENGTH is it might not be very cache efficient.

    With a small change:

    Code:
    #define MAX_ZOB_SIZE ???
    
    unsigned int keys[256][MAX_ZOB_SIZE];
    
    unsigned int hash_string(std::string s) {
    
    	unsigned int hash = 0;
    	
    	for (size_t i = 0; i < s.length(); ++i) {
    		hash ^= zobrist_keys[s[i]][i % MAX_ZOB_SIZE];
    	}
    
            return hash;
    }
    You are not limited to a maximum string length, and you can control how big each Zobrist bucket is. You might find that smaller MAX_ZOB_SIZE gives better performance without hurting the hash all that much.

    EDIT: Other notes... You're passing s by value, which is gonna hurt. And the compiler might optimize the comparison with s.length(), but maybe not..
    Last edited by brewbuck; 04-10-2009 at 03:54 PM.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  3. #3
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Ah I see.

    Thanks. That makes sense.

    I just Googled for some existing code, and found this -
    http://www.cse.yorku.ca/~oz/hash.html

    djb2
    this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
    Code:
        unsigned long
        hash(unsigned char *str)
        {
            unsigned long hash = 5381;
            int c;
    
            while (c = *str++)
                hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    
            return hash;
        }
    I don't usually like to use code that I don't understand (how good the hash is), but this one is just so much more convenient... and easy on the cache, too.
    Last edited by cyberfish; 04-10-2009 at 04:09 PM.

  4. #4
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    This is the hashing algorithm I've been using. It is the same as bernstein's except it uses XOR (like the comment in the link suggests).

    Code:
    unsigned int
    get_hash(const char* s)
    {
        unsigned int hash = 0;
        int c;
    
        while((c = *s++))
        {
            /* hash = hash * 33 ^ c */
            hash = ((hash << 5) + hash) ^ c;
        }
    
        return hash;
    }

  5. #5
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    The DJB2 hash (which is what has been posted here) is an excellent hash algorithm.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 8
    Last Post: 04-25-2008, 02:45 PM
  2. Linked List Help
    By CJ7Mudrover in forum C Programming
    Replies: 9
    Last Post: 03-10-2004, 10:33 PM
  3. Classes inheretance problem...
    By NANO in forum C++ Programming
    Replies: 12
    Last Post: 12-09-2002, 03:23 PM
  4. creating class, and linking files
    By JCK in forum C++ Programming
    Replies: 12
    Last Post: 12-08-2002, 02:45 PM
  5. Again Character Count, Word Count and String Search
    By client in forum C Programming
    Replies: 2
    Last Post: 05-09-2002, 11:40 AM