# Question about Hash (once more)

• 01-12-2009
audinue
Anyone know how important Hash is? Hash table, etc.

I've read in many (Compiler/Interpreter Construction)/Parsing books, including The Dragon Book, that to match series of string with one string, we must use hashes.

For instance, matching the current token with symbols in symbol table.

Because of my stupid-ness and I don't know how those hashing algorithm works.
The only one that can I do is just TEST them. And that spend 3 days (3 x 24 hours) to test those algorithm. Fortunately I have 3 computers in my house with 2 cores each (so I can do multithreading).

Somebody has made a program like that, but with random strings and I absolutely don't trust to those random things. I just need the exact one that guarantee anything.

And what is the result?
RS, JS, PJW, ELF, BKDR, SDBM, DJB, DJB2, DEK and AP hash algorithms still have collisions even with C variable naming character set: A..Z, a..z, 0..9, _, \$

What's wrong with linear-strings searching?

• 01-12-2009
CornedBee
Hashing is a very important technique in many problems.

However, unless you have a perfect hasher (and finding an formulaic perfect hasher is nearly impossible), hash comparison can only be used as a fast negative chooser! A search in a hash table first looks up candidates by hash, and then does a normal comparison to find a definite match.

This means that a few collisions don't matter. Unless it's a cryptographic hash, but that's a completely different area.

Comparison of a series of known strings with one unknown string means that you have precomputed the hashes of the known strings. So you compute the hash of the new string and then can do just a single comparison with every string of the series. If you do normal string comparison, you have to compare the full strings every time. This is a lot slower.
Or you do a hash lookup. Since hashes are numbers, you can modulo them to an array index and quickly find possible candidates for matching. Thus, you avoid comparing against every string in the set. If you use normal string comparison, the best you can do is a tree of the strings.

So, for M strings and a string length of N (for simplicity, we assume that all strings are the same length), the complexity of completely naive search is M*N. For naive comparison in a map, it's log(M) * N. For hashing with naive search, assuming no collisions in the particular string set you have (which is likely), the cost is M+N (N for calculating the hash of the search string, M for comparing it against every hash, and N again (but that doesn't matter) for doing the validation comparison). For a hash map, the cost is N, for calculating the hash of the search string and doing the validation comparison.

So as you can easily see, the hash map has by far the best complexity.

You're just hung up on the details.
• 01-12-2009
audinue
Quote:

However, unless you have a perfect hasher (and finding an formulaic perfect hasher is nearly impossible), hash comparison can only be used as a fast negative chooser! A search in a hash table first looks up candidates by hash, and then does a normal comparison to find a definite match.
I got it! Thanks.

Quote:

You're just hung up on the details.
Thank you for remind me, CornedBee.

By the way, do you have any suggestion for choosing hash algorithm?

I quiet don't understand about them...
• 01-12-2009
VirtualAce
Hash collisions are no big deal. If one occurs you can decide what to do but it is my opinion that instead of re-hashing or doing something else that it is the user of the hash table, not the hash table object itself, to use a different string. This works in most applications and is far simpler to code.

Hashes are extremely important when it comes to strings. Sure you could use a map but you are looking up a string value every time you access the map. With a hash you just hash the string once and return it to the caller. Now the caller accesses the object via the hash and you get O(1) access as opposed to O(log N) with a map.

I have made hash tables that will compute a random string and re-hash and all in all it really does not take long to create random strings to resolve collisions. In either case the user of the hash table will receive it's ID back from the hash table object just the same. There is another approach where you can add the object to a list at the hashed index. This gives O(1) to find the index of the object and O(n) to resolve collisions.

There are some excellent hash algorithms available on the internet.
• 01-13-2009
CornedBee
It depends on the situation. When the strings you add to the table come from some external source, you cannot well ask it to use a different string, so you'll go with the link bucket hash map.
• 01-13-2009
audinue
Quote: