String hashing

**rags123** · 04-03-2010

hey guys. i have been trying to implement the dictionary(english-english) using hashing which will ofcourse be having words stored. the problem i am currently facing in the insertion is the hash key collisions.i can generate the random hash key to every subject BUT when retrieving it will be impossible to retrieve a particular word.so if you guys could suggest me the formula to generate the UNIQUE hash key to every new word

Regards

**MK27** · 04-03-2010

You don't. I suppose it is possible but it would probably mean the hash table has to be quite a bit bigger than the number of inputs, unless the inputs are intentionally constructed to contain a sequential series of unique values with no gaps in the series. $1 million says the dictionary cannot be reduced to such a series.

Hash key collisions are not something to avoid; they are something to reasonably minimize but also to recognize and deal with. One effective, common way to do this is to use a linked list for each bucket/key: all words with the same key get put into a linked list together. Having to iterate thru a list of 100 words is a significant improvement over iterating thru a list of 100000, esp considering you can keep the list sorted. So the hash table has done it's job. With a specific dictionary, you could also do some analysis and arrange B-trees instead of linked lists, etc, or even make the hash table an array of hash tables of linked lists, where the inner tables use a different modulus than the main table.

Here's part of the distribution of words in a %500 key has table using an english dictionary of 80,000 words (using a sum of ascii values):

Code:

So you can see at the high end here we have key 357 which contains 243 words and at the low end 319 which contains 4 words. The biggest lists in this table have more than 600 words, but there are still a few dozen keys with 0 words, right in the middle of the table.

See the problem this points to? The greater the size of the table, the sparser the distribution. Also, using the sum of ascii values (actually this one is ascii-32, since it was generically intended for keys that could contain spaces and punctuation), there is no point using a table with much more than 1000 keys. At 1000 keys, the last 350 of them are empty and there are still lists with over 600 words.

But sorting and search a 600 word list is trivial anyway. Like I said, you could refine this and customize it to your dictionary with a two level hash and b-trees, I have no idea of the distribution you'd end up with but if were, eg, two modulus ops to a b-tree with <20 members that is effectively instantaneous.

You should also be able to taylor the hash function to the dictionary beyond just a sum of ascii values, eg, you could weight each position to create a greater possible array of keys, but don't aim for anything beyond 5% of the input size IMO. Even 5% is a lot. Hashing is a divide, then conquer scheme, not divide == conquer.

Thread: String hashing

Thread Tools

Search Thread

Display

String hashing

Similar Threads

char Handling, probably typical newbie stuff

String editor for a sentence inputted by a user - any suggestions or ideas?

Classes inheretance problem...

creating class, and linking files

Warnings, warnings, warnings?