Thread: Searching texts via hashing

  1. #1
    Registered User
    Join Date
    Apr 2020
    Location
    Greater Philadelphia
    Posts
    26

    Searching texts via hashing

    This is more an algorithm question than a C language question at this stage. Can anyone refer me to article(s) that a non-mathematician can understand on the technique of using hashes to search a set of text records? If this technique has a name, I don't know what it is.

    What I'm trying exploits the fact that the cube root of 65536 is just over 40. This means that a 2-byte unsigned integer J can uniquely represent any case-insensitive sequence of three alphanumeric characters (36*36*36), with encoding up to four additional characters possible if desired (40*40*40 would fit). For each text in a database, we then make a hash record of N bits (size predetermined and customizable for the application): Per triplet, a bit corresponding to J%N is set in the hash record. The string "ABCD", for example, would call for two bits to be set, for the triplets "ABC" and "BCD" (except if the remainder happens to be the same).

    The effect is that similar plain texts produce similar hashes. In particular, the same words appearing in whatever order would produce similar hashes. The hash of a query can be compared against an array or file of hashes representing a database to retrieve the records most likely to be of interest.

    I've now done this much. With a database of some 200,000 records whose text fields average about 35 characters, the corresponding hashes of 32 bytes can be kept in memory. I am delighted with the results of queries for my purposes-- both their precision and the program's performance, even though it is inherently very inefficient. A hash contains a median of ca. 14 bits set (less than 6 percent). The hash size could be reduced, maybe even halved, without introducing too many false positives. I also have in mind a way (with another layer of complexity) that should make it practical to support a much larger data base.

    I'd be glad to invite your thoughts around the several central C functions in the program (fairly simple, actually), but first would like to look at a few existing discussions of such procedures, if there are any that I can understand. Perhaps I am missing simple refinements or even errors.

    The cube root of 65536 (two byte range) is the same number as the sixth root of the four-byte range. That I said "amazing!" rather than "of course!" when noticing this gives you some idea of my current math level, after probably forgetting half of what I ever learned.

    Surely nothing of this is original. For all I know, Google uses some elaborate form of it. But it does differ from many applications of hashing in that we want similar hashes for similar plain texts, whereas usually we want completely different hashes, close to random, making decryption as difficult as possible. So most discussions of hashing are not very relevant.
    Last edited by Alogon; 04-18-2020 at 12:54 AM.

  2. #2
    Registered User
    Join Date
    Dec 2017
    Posts
    1,633
    A little inaccuracy saves tons of explanation. - H.H. Munro

  3. #3
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by Alogon View Post
    This is more an algorithm question than a C language question at this stage. Can anyone refer me to article(s) that a non-mathematician can understand on the technique of using hashes to search a set of text records? If this technique has a name, I don't know what it is.

    What I'm trying exploits the fact that the cube root of 65536 is just over 40. This means that a 2-byte unsigned integer J can uniquely represent any case-insensitive sequence of three alphanumeric characters (36*36*36), with encoding up to four additional characters possible if desired (40*40*40 would fit). For each text in a database, we then make a hash record of N bits (size predetermined and customizable for the application): Per triplet, a bit corresponding to J%N is set in the hash record. The string "ABCD", for example, would call for two bits to be set, for the triplets "ABC" and "BCD" (except if the remainder happens to be the same).

    The effect is that similar plain texts produce similar hashes. In particular, the same words appearing in whatever order would produce similar hashes. The hash of a query can be compared against an array or file of hashes representing a database to retrieve the records most likely to be of interest.

    I've now done this much. With a database of some 200,000 records whose text fields average about 35 characters, the corresponding hashes of 32 bytes can be kept in memory. I am delighted with the results of queries for my purposes-- both their precision and the program's performance, even though it is inherently very inefficient. A hash contains a median of ca. 14 bits set (less than 6 percent). The hash size could be reduced, maybe even halved, without introducing too many false positives. I also have in mind a way (with another layer of complexity) that should make it practical to support a much larger data base.

    I'd be glad to invite your thoughts around the several central C functions in the program (fairly simple, actually), but first would like to look at a few existing discussions of such procedures, if there are any that I can understand. Perhaps I am missing simple refinements or even errors.

    The cube root of 65536 (two byte range) is the same number as the sixth root of the four-byte range. That I said "amazing!" rather than "of course!" when noticing this gives you some idea of my current math level, after probably forgetting half of what I ever learned.

    Surely nothing of this is original. For all I know, Google uses some elaborate form of it. But it does differ from many applications of hashing in that we want similar hashes for similar plain texts, whereas usually we want completely different hashes, close to random, making decryption as difficult as possible. So most discussions of hashing are not very relevant.
    Sure, let's see what you've got. Have you had a chance to look into bloom filters yet? The lookup time-complexity is more or less O(1).

  4. #4
    Registered User
    Join Date
    Apr 2020
    Location
    Greater Philadelphia
    Posts
    26
    Quote Originally Posted by Sir Galahad View Post
    Sure, let's see what you've got. Have you had a chance to look into <a href="https://en.wikipedia.org/wiki/Bloom_filter" target="_blank">bloom filters</a> yet? The lookup time-complexity is more or less O(1).
    Very interesting idea. But the article says "a query returns either "possibly in set" or "definitely not in set." Are Bloom filters most appropriate for applications that seek only exact matches? As developed so far, my program obtains a score from comparison of hashes, stores references to the top ten scores (on a running basis), and finally displays the top-scoring record and any others with the same highest score. The user may then ask to see all of the top ten, which are shown ordered by score. Possibly there is more than one exact match (that is, the normalized query text is identical to the entire normalized text of a database record), or no exact match yet an approximate match (one or more matching sub-strings) provides what the user is looking for.

    But could a Bloom filter be useful as a preliminary step to prevent any further examination of negatives, even if it does not suffice by itself?Would this work: (1) hash the records as I am doing; (2) make a Bloom filter on the hashes; (3) hash a query in the same manner as step 1; (4) query the Bloom filter for each bit set in the query hash and score the results; (5) examine the top-scoring original hashes, which should probably be many more than ten, but still a fraction of the whole.

    But this might be redundant with another idea I had to accommodate a larger database: store the hashes not in the way they were originally produced (file A), but conceptually rotated 90 degrees (B)-- the same data arranged differently. B would contain one record per bit available in a record of A, and the bits of B would represent record numbers in the database. Each hash record in B would be much larger, but only those records in B corresponding to a bit set in A would need to be consulted. Perhaps this technique is known as an inverted index.


    How the final score is calculated is an interesting topic in itself, probably involving measures such as Hamming distance or Jaccard coefficient. I must eventually play with these to see what gives the best results for my needs.

    Thanks for your interest! I want to post some code soon but first should test a couple of ideas to make it more efficient.
    Last edited by Alogon; 04-20-2020 at 10:27 PM.

  5. #5
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by Alogon View Post
    Very interesting idea. But the article says "a query returns either "possibly in set" or "definitely not in set." Are Bloom filters most appropriate for applications that seek only exact matches?
    Well "possibly in set" necessarily implies inexact matching, so no. They just help speed up the whole process by skipping patterns which are "definitely not in set". And also relatively easy on memory.

    Quote Originally Posted by Alogon View Post
    But could a Bloom filter be useful as a preliminary step to prevent any further examination of negatives, even if it does not suffice by itself?Would this work: (1) hash the records as I am doing; (2) make a Bloom filter on the hashes; (3) hash a query in the same manner as step 1; (4) query the Bloom filter for each bit set in the query hash and score the results;
    Sounds about right so far...

    Quote Originally Posted by Alogon View Post
    (5) examine the top-scoring original hashes, which should probably be many more than ten, but still a fraction of the whole.
    [...]
    How the final score is calculated is an interesting topic in itself, probably involving measures such as Hamming distance or Jaccard coefficient.
    Sorry you lost me there. What exactly do you mean by that, scoring hashes?

    Quote Originally Posted by Alogon View Post
    But this might be redundant with another idea I had to accommodate a larger database: store the hashes not in the way they were originally produced (file A), but conceptually rotated 90 degrees (B)-- the same data arranged differently. B would contain one record per bit available in a record of A, and the bits of B would represent record numbers in the database. Each hash record in B would be much larger, but only those records in B corresponding to a bit set in A would need to be consulted. Perhaps this technique is known as an inverted index.
    That I would not know. What's the point of the rotation?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. how to get all texts from a file by ifstream ?
    By Mouse_103 in forum C++ Programming
    Replies: 5
    Last Post: 04-11-2006, 08:21 AM
  2. hashing & searching
    By Micko in forum C++ Programming
    Replies: 9
    Last Post: 03-16-2004, 06:20 AM
  3. Copying Texts from Edit Boxes.
    By incognito in forum Windows Programming
    Replies: 1
    Last Post: 04-22-2003, 08:37 PM
  4. Storing String (Texts) :: ASM
    By kuphryn in forum A Brief History of Cprogramming.com
    Replies: 2
    Last Post: 09-15-2002, 05:21 PM
  5. Bought Some C Texts
    By Troll_King in forum A Brief History of Cprogramming.com
    Replies: 5
    Last Post: 07-25-2002, 02:19 PM

Tags for this Thread