
Originally Posted by
Alogon
This is more an algorithm question than a C language question at this stage. Can anyone refer me to article(s) that a non-mathematician can understand on the technique of using hashes to search a set of text records? If this technique has a name, I don't know what it is.
What I'm trying exploits the fact that the cube root of 65536 is just over 40. This means that a 2-byte unsigned integer J can uniquely represent any case-insensitive sequence of three alphanumeric characters (36*36*36), with encoding up to four additional characters possible if desired (40*40*40 would fit). For each text in a database, we then make a hash record of N bits (size predetermined and customizable for the application): Per triplet, a bit corresponding to J%N is set in the hash record. The string "ABCD", for example, would call for two bits to be set, for the triplets "ABC" and "BCD" (except if the remainder happens to be the same).
The effect is that similar plain texts produce similar hashes. In particular, the same words appearing in whatever order would produce similar hashes. The hash of a query can be compared against an array or file of hashes representing a database to retrieve the records most likely to be of interest.
I've now done this much. With a database of some 200,000 records whose text fields average about 35 characters, the corresponding hashes of 32 bytes can be kept in memory. I am delighted with the results of queries for my purposes-- both their precision and the program's performance, even though it is inherently very inefficient. A hash contains a median of ca. 14 bits set (less than 6 percent). The hash size could be reduced, maybe even halved, without introducing too many false positives. I also have in mind a way (with another layer of complexity) that should make it practical to support a much larger data base.
I'd be glad to invite your thoughts around the several central C functions in the program (fairly simple, actually), but first would like to look at a few existing discussions of such procedures, if there are any that I can understand. Perhaps I am missing simple refinements or even errors.
The cube root of 65536 (two byte range) is the same number as the sixth root of the four-byte range. That I said "amazing!" rather than "of course!" when noticing this gives you some idea of my current math level, after probably forgetting half of what I ever learned.
Surely nothing of this is original. For all I know, Google uses some elaborate form of it. But it does differ from many applications of hashing in that we want similar hashes for similar plain texts, whereas usually we want completely different hashes, close to random, making decryption as difficult as possible. So most discussions of hashing are not very relevant.