"It's all about the distribution of entries in the table. The idea is that
you want to spread them out throughout the table as best as you can, so that
you minimize collisions. If the multiplier and the table size are relatively
prime, then you end up distributing entries across every "slot" in the table,
assuming "sufficiently random" data.
The easiest way to get the multiplier and hash table size relatively prime,
meaning they share no common factors, or that if n = the table size is a
positive integer greater than 1 and m = the multiplier is likewise a
positive integer greater than 1, then gcd(n, m) = 1, where gcd(n, m) is the
positive greatest common divisor of n and m, is to make one or both prime
numbers. Recall that a mod n = r (a, n positive integers) means that for
some integer s, a = ns + r. We assume further that n > m.
To show why this relatively prime business is important, suppose we are
doing open hashing as in my earlier example with some other method for
collision resolution (e.g., chaining). Now, let d = gcd(n, m) >= 1.
Given an input datum consisting of the string a_1 a_2 a_3 ... a_k, where
0 <= a_k <= B for some constant positive integer B, the hash becomes
the nonnegative integer,
k
---
r = ( > m a ) mod n
--- i
i=1
Let u be the sum over i = 1 to k of a_i; then the above becomes:
r = (mu) mod n,
since m is constant with respect to the sum, and thus can be factored out.
But, since m has d as a factor, we know there exists some positive integer q
such that m = dq and thus the hash is equivalent to r = (dqu) mod n which
implies that dqu = sn + r => dqu - r = sn => s | (dqu - r), where ``a | b''
stands for "a divides b" (or b has a as a factor), and s is as before. But,
recalling that n also has d as a factor, we see that there exists some
positive integer t such that n = dt. Then, dqu = sdt + r which implies that
dqu - r = sdt => sd | (dqu - r) => there exists some integer v such that
v = (dqu - r) / sd => vs = (dqu - r) / d is also an integer (by closure of
multiplication in the integers). Then, vs = (dqu - r) / d = dqu/d - r/d.
But, clearly d | dqu and dqu/d = qu is clearly an integer by closure, and
since vs is an integer, it follows that r/d is also an integer and hence
d | r => every hash value must be a multiple of d. It follows that every
slot in the hash table that is not an integer multiple of d is unused.
If d > 1 => n, m are not relatively prime, then only every d'th slot in
the table will be used. Consider the case where n = 8 and m = 4 and then
gcd(n, m) = 4 implies that only the 0'th and 4'th slots in the table will
be used, and the hash table is only as effective as one with two slots.
On the other hand, if n and m are relatively prime, then by definition,
gcd(n, m) = 1 and every slot in the table will be used.
Of course, this still all assumes that the data we're hashing cooperates and
doesn't give us some weird values. For instance, if the data were always
some multiple of the modulus when shifted and summed, you'd end up in the
same boat. You can try and ameliorate this case by careful choice of
collision resolution techniques, using double hashing with hash algorithms,
etc. It gets into some real black magic.
In practice, however, for most string data the function I posted is quite
good.
"