Searching texts via hashing

**Alogon** · 04-18-2020

This is more an algorithm question than a C language question at this stage. Can anyone refer me to article(s) that a non-mathematician can understand on the technique of using hashes to search a set of text records? If this technique has a name, I don't know what it is.

What I'm trying exploits the fact that the cube root of 65536 is just over 40. This means that a 2-byte unsigned integer J can uniquely represent any case-insensitive sequence of three alphanumeric characters (36*36*36), with encoding up to four additional characters possible if desired (40*40*40 would fit). For each text in a database, we then make a hash record of N bits (size predetermined and customizable for the application): Per triplet, a bit corresponding to J%N is set in the hash record. The string "ABCD", for example, would call for two bits to be set, for the triplets "ABC" and "BCD" (except if the remainder happens to be the same).

The effect is that similar plain texts produce similar hashes. In particular, the same words appearing in whatever order would produce similar hashes. The hash of a query can be compared against an array or file of hashes representing a database to retrieve the records most likely to be of interest.

I've now done this much. With a database of some 200,000 records whose text fields average about 35 characters, the corresponding hashes of 32 bytes can be kept in memory. I am delighted with the results of queries for my purposes-- both their precision and the program's performance, even though it is inherently very inefficient. A hash contains a median of ca. 14 bits set (less than 6 percent). The hash size could be reduced, maybe even halved, without introducing too many false positives. I also have in mind a way (with another layer of complexity) that should make it practical to support a much larger data base.

I'd be glad to invite your thoughts around the several central C functions in the program (fairly simple, actually), but first would like to look at a few existing discussions of such procedures, if there are any that I can understand. Perhaps I am missing simple refinements or even errors.

The cube root of 65536 (two byte range) is the same number as the sixth root of the four-byte range. That I said "amazing!" rather than "of course!" when noticing this gives you some idea of my current math level, after probably forgetting half of what I ever learned.

Surely nothing of this is original. For all I know, Google uses some elaborate form of it. But it does differ from many applications of hashing in that we want similar hashes for similar plain texts, whereas usually we want completely different hashes, close to random, making decryption as difficult as possible. So most discussions of hashing are not very relevant.

**john.c** · 04-18-2020

A couple pages that might interest you:
Approximate string matching - Wikipedia
The Anatomy of a Search Engine

**Sir Galahad** · 04-18-2020

Originally Posted by Alogon

This is more an algorithm question than a C language question at this stage. Can anyone refer me to article(s) that a non-mathematician can understand on the technique of using hashes to search a set of text records? If this technique has a name, I don't know what it is.

What I'm trying exploits the fact that the cube root of 65536 is just over 40. This means that a 2-byte unsigned integer J can uniquely represent any case-insensitive sequence of three alphanumeric characters (36*36*36), with encoding up to four additional characters possible if desired (40*40*40 would fit). For each text in a database, we then make a hash record of N bits (size predetermined and customizable for the application): Per triplet, a bit corresponding to J%N is set in the hash record. The string "ABCD", for example, would call for two bits to be set, for the triplets "ABC" and "BCD" (except if the remainder happens to be the same).

The effect is that similar plain texts produce similar hashes. In particular, the same words appearing in whatever order would produce similar hashes. The hash of a query can be compared against an array or file of hashes representing a database to retrieve the records most likely to be of interest.

I've now done this much. With a database of some 200,000 records whose text fields average about 35 characters, the corresponding hashes of 32 bytes can be kept in memory. I am delighted with the results of queries for my purposes-- both their precision and the program's performance, even though it is inherently very inefficient. A hash contains a median of ca. 14 bits set (less than 6 percent). The hash size could be reduced, maybe even halved, without introducing too many false positives. I also have in mind a way (with another layer of complexity) that should make it practical to support a much larger data base.

I'd be glad to invite your thoughts around the several central C functions in the program (fairly simple, actually), but first would like to look at a few existing discussions of such procedures, if there are any that I can understand. Perhaps I am missing simple refinements or even errors.

The cube root of 65536 (two byte range) is the same number as the sixth root of the four-byte range. That I said "amazing!" rather than "of course!" when noticing this gives you some idea of my current math level, after probably forgetting half of what I ever learned.

Surely nothing of this is original. For all I know, Google uses some elaborate form of it. But it does differ from many applications of hashing in that we want similar hashes for similar plain texts, whereas usually we want completely different hashes, close to random, making decryption as difficult as possible. So most discussions of hashing are not very relevant.

Sure, let's see what you've got. Have you had a chance to look into bloom filters yet? The lookup time-complexity is more or less O(1).

**Alogon** · 04-20-2020

Originally Posted by Sir Galahad

Sure, let's see what you've got. Have you had a chance to look into <a href="https://en.wikipedia.org/wiki/Bloom_filter" target="_blank">bloom filters</a> yet? The lookup time-complexity is more or less O(1).

Very interesting idea. But the article says "a query returns either "possibly in set" or "definitely not in set." Are Bloom filters most appropriate for applications that seek only exact matches? As developed so far, my program obtains a score from comparison of hashes, stores references to the top ten scores (on a running basis), and finally displays the top-scoring record and any others with the same highest score. The user may then ask to see all of the top ten, which are shown ordered by score. Possibly there is more than one exact match (that is, the normalized query text is identical to the entire normalized text of a database record), or no exact match yet an approximate match (one or more matching sub-strings) provides what the user is looking for.

But could a Bloom filter be useful as a preliminary step to prevent any further examination of negatives, even if it does not suffice by itself?Would this work: (1) hash the records as I am doing; (2) make a Bloom filter on the hashes; (3) hash a query in the same manner as step 1; (4) query the Bloom filter for each bit set in the query hash and score the results; (5) examine the top-scoring original hashes, which should probably be many more than ten, but still a fraction of the whole.

But this might be redundant with another idea I had to accommodate a larger database: store the hashes not in the way they were originally produced (file A), but conceptually rotated 90 degrees (B)-- the same data arranged differently. B would contain one record per bit available in a record of A, and the bits of B would represent record numbers in the database. Each hash record in B would be much larger, but only those records in B corresponding to a bit set in A would need to be consulted. Perhaps this technique is known as an inverted index.

How the final score is calculated is an interesting topic in itself, probably involving measures such as Hamming distance or Jaccard coefficient. I must eventually play with these to see what gives the best results for my needs.

Thanks for your interest! I want to post some code soon but first should test a couple of ideas to make it more efficient.

**Sir Galahad** · 04-21-2020

Originally Posted by Alogon

Very interesting idea. But the article says "a query returns either "possibly in set" or "definitely not in set." Are Bloom filters most appropriate for applications that seek only exact matches?

Well "possibly in set" necessarily implies inexact matching, so no. They just help speed up the whole process by skipping patterns which are "definitely not in set". And also relatively easy on memory.

Originally Posted by Alogon

But could a Bloom filter be useful as a preliminary step to prevent any further examination of negatives, even if it does not suffice by itself?Would this work: (1) hash the records as I am doing; (2) make a Bloom filter on the hashes; (3) hash a query in the same manner as step 1; (4) query the Bloom filter for each bit set in the query hash and score the results;

Sounds about right so far...

Originally Posted by Alogon

(5) examine the top-scoring original hashes, which should probably be many more than ten, but still a fraction of the whole.
[...]
How the final score is calculated is an interesting topic in itself, probably involving measures such as Hamming distance or Jaccard coefficient.

Sorry you lost me there. What exactly do you mean by that, scoring hashes?

Originally Posted by Alogon

But this might be redundant with another idea I had to accommodate a larger database: store the hashes not in the way they were originally produced (file A), but conceptually rotated 90 degrees (B)-- the same data arranged differently. B would contain one record per bit available in a record of A, and the bits of B would represent record numbers in the database. Each hash record in B would be much larger, but only those records in B corresponding to a bit set in A would need to be consulted. Perhaps this technique is known as an inverted index.

That I would not know. What's the point of the rotation?

Thread: Searching texts via hashing

Thread Tools

Search Thread

Display

Searching texts via hashing

Similar Threads

how to get all texts from a file by ifstream ?

hashing & searching

Copying Texts from Edit Boxes.

Storing String (Texts) :: ASM

Bought Some C Texts

Tags for this Thread