Thread: String hashing

  1. #1
    Registered User
    Join Date
    Mar 2010
    Posts
    6

    String hashing

    hey guys. i have been trying to implement the dictionary(english-english) using hashing which will ofcourse be having words stored. the problem i am currently facing in the insertion is the hash key collisions.i can generate the random hash key to every subject BUT when retrieving it will be impossible to retrieve a particular word.so if you guys could suggest me the formula to generate the UNIQUE hash key to every new word

    Regards

  2. #2
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    You don't. I suppose it is possible but it would probably mean the hash table has to be quite a bit bigger than the number of inputs, unless the inputs are intentionally constructed to contain a sequential series of unique values with no gaps in the series. $1 million says the dictionary cannot be reduced to such a series.

    Hash key collisions are not something to avoid; they are something to reasonably minimize but also to recognize and deal with. One effective, common way to do this is to use a linked list for each bucket/key: all words with the same key get put into a linked list together. Having to iterate thru a list of 100 words is a significant improvement over iterating thru a list of 100000, esp considering you can keep the list sorted. So the hash table has done it's job. With a specific dictionary, you could also do some analysis and arrange B-trees instead of linked lists, etc, or even make the hash table an array of hash tables of linked lists, where the inner tables use a different modulus than the main table.

    Here's part of the distribution of words in a %500 key has table using an english dictionary of 80,000 words (using a sum of ascii values):
    Code:
    318 8
    319 4
    320 5
    321 10
    322 10
    323 15
    324 11
    325 18
    326 15
    327 30
    328 36
    329 33
    330 34
    331 57
    332 43
    333 65
    334 67
    335 92
    336 85
    337 82
    338 101
    339 107
    340 93
    341 116
    342 127
    343 128
    344 143
    345 166
    346 163
    347 187
    348 165
    349 182
    350 192
    351 215
    352 197
    353 220
    354 214
    355 226
    356 199
    357 243
    So you can see at the high end here we have key 357 which contains 243 words and at the low end 319 which contains 4 words. The biggest lists in this table have more than 600 words, but there are still a few dozen keys with 0 words, right in the middle of the table.

    See the problem this points to? The greater the size of the table, the sparser the distribution. Also, using the sum of ascii values (actually this one is ascii-32, since it was generically intended for keys that could contain spaces and punctuation), there is no point using a table with much more than 1000 keys. At 1000 keys, the last 350 of them are empty and there are still lists with over 600 words.

    But sorting and search a 600 word list is trivial anyway. Like I said, you could refine this and customize it to your dictionary with a two level hash and b-trees, I have no idea of the distribution you'd end up with but if were, eg, two modulus ops to a b-tree with <20 members that is effectively instantaneous.

    You should also be able to taylor the hash function to the dictionary beyond just a sum of ascii values, eg, you could weight each position to create a greater possible array of keys, but don't aim for anything beyond 5% of the input size IMO. Even 5% is a lot. Hashing is a divide, then conquer scheme, not divide == conquer.
    Last edited by MK27; 04-03-2010 at 07:52 AM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. char Handling, probably typical newbie stuff
    By Neolyth in forum C Programming
    Replies: 16
    Last Post: 06-21-2009, 04:05 AM
  2. Replies: 4
    Last Post: 03-03-2006, 02:11 AM
  3. Classes inheretance problem...
    By NANO in forum C++ Programming
    Replies: 12
    Last Post: 12-09-2002, 03:23 PM
  4. creating class, and linking files
    By JCK in forum C++ Programming
    Replies: 12
    Last Post: 12-08-2002, 02:45 PM
  5. Warnings, warnings, warnings?
    By spentdome in forum C Programming
    Replies: 25
    Last Post: 05-27-2002, 06:49 PM