Quote:

"It's all about the distribution of entries in the table. The idea is that

you want to spread them out throughout the table as best as you can, so that

you minimize collisions. If the multiplier and the table size are relatively

prime, then you end up distributing entries across every "slot" in the table,

assuming "sufficiently random" data.

The easiest way to get the multiplier and hash table size relatively prime,

meaning they share no common factors, or that if n = the table size is a

positive integer greater than 1 and m = the multiplier is likewise a

positive integer greater than 1, then gcd(n, m) = 1, where gcd(n, m) is the

positive greatest common divisor of n and m, is to make one or both prime

numbers. Recall that a mod n = r (a, n positive integers) means that for

some integer s, a = ns + r. We assume further that n > m.

To show why this relatively prime business is important, suppose we are

doing open hashing as in my earlier example with some other method for

collision resolution (e.g., chaining). Now, let d = gcd(n, m) >= 1.

Given an input datum consisting of the string a_1 a_2 a_3 ... a_k, where

0 <= a_k <= B for some constant positive integer B, the hash becomes

the nonnegative integer,

k

---

r = ( > m a ) mod n

--- i

i=1

Let u be the sum over i = 1 to k of a_i; then the above becomes:

r = (mu) mod n,

since m is constant with respect to the sum, and thus can be factored out.

But, since m has d as a factor, we know there exists some positive integer q

such that m = dq and thus the hash is equivalent to r = (dqu) mod n which

implies that dqu = sn + r => dqu - r = sn => s | (dqu - r), where ``a | b''

stands for "a divides b" (or b has a as a factor), and s is as before. But,

recalling that n also has d as a factor, we see that there exists some

positive integer t such that n = dt. Then, dqu = sdt + r which implies that

dqu - r = sdt => sd | (dqu - r) => there exists some integer v such that

v = (dqu - r) / sd => vs = (dqu - r) / d is also an integer (by closure of

multiplication in the integers). Then, vs = (dqu - r) / d = dqu/d - r/d.

But, clearly d | dqu and dqu/d = qu is clearly an integer by closure, and

since vs is an integer, it follows that r/d is also an integer and hence

d | r => every hash value must be a multiple of d. It follows that every

slot in the hash table that is not an integer multiple of d is unused.

If d > 1 => n, m are not relatively prime, then only every d'th slot in

the table will be used. Consider the case where n = 8 and m = 4 and then

gcd(n, m) = 4 implies that only the 0'th and 4'th slots in the table will

be used, and the hash table is only as effective as one with two slots.

On the other hand, if n and m are relatively prime, then by definition,

gcd(n, m) = 1 and every slot in the table will be used.

Of course, this still all assumes that the data we're hashing cooperates and

doesn't give us some weird values. For instance, if the data were always

some multiple of the modulus when shifted and summed, you'd end up in the

same boat. You can try and ameliorate this case by careful choice of

collision resolution techniques, using double hashing with hash algorithms,

etc. It gets into some real black magic.

In practice, however, for most string data the function I posted is quite

good.

"