# Rabin-Karp algorithm

Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last
• 09-24-2010
Aisthesis
Rabin-Karp algorithm
This is an algorithm for finding a pattern substring in a text string. I got it from Skiena, Design Algorithm Manual, 2008, pp. 91f. It works by hashing the substring and starts getting nice results when the substring is fairly long because it uses the hash value derived for the previous test substring by just chopping off the first character of the hash and adding on 1 new character.

My questions are how to optimize performance by choosing an appropriate modulus for hashing (big modulus should mean fewer false positives but i would expect it to make multiplication and modding slower--?) and exactly when it should start being faster than a character-by-character search. My initial experiments suggest that this implementation isn't as fast as the find() method in <string>, but it's pretty fast, even on a string of length 10M with the pattern substring of length 100K tacked on at the end.

Code:

```#ifndef STRING_SEARCH_H #define STRING_SEARCH_H #include <string> namespace hash_search {         const size_t BASE = 128; // small ASCII         /**         @param s String for which to return a hash value         @param modulus number (preferably prime) to use for hashing         @return Hash value         */         int hasher(const std::string &s, const unsigned &modulus);         /**         There should be some kind of "golden value" for modulus         The bigger modulus is, the more accurate the hasher,         but the hash test will presumably also tend to get slower.         @param text String to be tested for instances of pat         @param pat Pattern string for which to test         @param modulus for choosing the method by which to hash before exhaustive comparisons         @return Index of the first instance of pat found in text         npos if pat not found         0 if pat is empty string         */         size_t find(const std::string &text, const std::string &pat, const unsigned &modulus);         unsigned prime_generator(unsigned min);        // will generate the next prime at least as big as min } #endif```
and here's the implementation (not very proud of my prime generator, but it works, and that's no what i'm trying to optimize)

Code:

```#include "string_search.h" #include <string> #include <cmath> #include <cassert> int hash_search::hasher(const std::string &s, const unsigned &modulus) {         int result = 0;         size_t len = s.size();         for (size_t i = 0; i < len; ++i) {                 result = (result * hash_search::BASE + s.at(i)) % modulus;         }         return result; } size_t hash_search::find(const std::string &text, const std::string &pat, const unsigned &modulus) {         assert(modulus > 1);         size_t pat_len = pat.size();         if (!pat_len) return 0;                // pat is empty string         if (text.size() < pat_len) return std::string::npos;        // no need to search if text is smaller than pat         size_t last_text_index = text.size() - pat.size();         int h_val = hash_search::hasher(pat, modulus);                // value we will be searching for         int tmp = hash_search::hasher(text.substr(0, pat_len), modulus);         size_t i = 0;         // a number we'll need for applying the Rabin-Karp algorithm: alpha ^ (m - 1) in Skiena, AMD, p. 92         size_t base_power = 1;         for (int j = 0; j < pat_len - 1; ++j) {                 base_power = (base_power * BASE) % modulus;         }         // main testing loop         while (true) {                 // do an exhaustive comparison only AFTER checking for equality of hash values                 if (tmp == h_val && !text.substr(i, pat_len).compare(pat)) return i;                 /**                 The following is the decisive algorithm--cf. Skiena, ADM, p. 92                 It provides a short-cut for calculating the hash.                 Otherwise we're saving no time at all vs. exhaustive comparison                 */                 if (i == last_text_index) break;                 tmp += (modulus - base_power) * text.at(i);                 tmp %= modulus;                 tmp *= BASE;                 tmp %= modulus;                 tmp += text.at(i + pat_len);                 tmp %= modulus;                 ++i;         }                 return std::string::npos;        // not found } unsigned hash_search::prime_generator(unsigned min) {         assert(min > 1);         unsigned max_factor, i;         while (true) {                 max_factor = std::sqrt((double)min);                 for (i = 2; i <= max_factor; ++i) {                         if (min % i == 0) break;                 }                 if (i > max_factor) return min;        // result is prime                 ++min;         } }```
• 09-24-2010
iMalc
So you're main focus is of course the main testing loop.
Basicallly the multiplications and modding take the same amount of time regardless of the operand values, sorry. In practice your plan is to just find a large value that works well and then hard code in this value right? Then you can drop the inefficient prime generator.

The main issue here is that you're needing three mod's per character, and divides or mods are quite slow.
This kind of thing will only be a win when the text to search through contains several bits that are a prefix of the code being searched for, (the longer the more pronounced).

I expect the Boyer Moore to be much faster than this one, in practice.
Or alternatively, even using something as simple as an XOR hash with this technique, could improve the speed over the naieve algorithm, whist adding very little overhead.

Also, don't pass modulus by const-reference. For built-in types, passing by value is preferred and can be more efficient.
• 09-24-2010
Aisthesis
just by running it on some big strings with different kinds of structures, i came up with an empirically
"optimal" modulus of something like 1511, although on my minimal sample sets i saw very little difference using primes above 700 and up to 8000-ish (i only even tested at values very far away from powers of 2, probably being too picky about that but i don't think it matters).

i gather that in addition to boyer moore, knuth morris pratt is in the running, although they do say that a variant of rabin karp works for something like plagiarism detection--where you're searching for a certain pattern type rather than an exact substring match.

could you perhaps elaborate a little on the xor suggestion? i'm not seeing how you could use xor hashing in the rabin-karp, since it isn't a homomorphism, as modulus is (?).
• 09-25-2010
iMalc
You start with the xor of pat_len characters, then while this hash does not match the hash of the search string, you xor that hash with the first byte that is currently part of the hash, effectively removing that byte's effect on the hash, and also xor in the next unseen byte. After doing this once you move from being a hash of the bytes from 0...pat_len-1 to 1...pat_len.

I.e Let H0 be the hash of bytes X0 to Xn-1. H1 (the hash of bytes X0 to Xn-1) can be obtained from H0 ^ X0 ^ Xn,
or in general Hm can be obtained from Hm-1 ^ Xm-1 ^ Xn+m. Sorry about the lack of subscripts there.
We're just exploiting the associativity of the XOR operation here, nothing fancy.

Admittedly that's a very poor hash. It only looks for what is possibly the same characters, though they could be in a totally different order.
It could be made better by rotating each byte by its index modulo 8 for example, when both XORing it in and XORing it out, which would make "abc" hash differently to "cba".
• 09-25-2010
phantomotap
That isn't the only problem with a simple `xor' approach.

For example, searching for "banana" would effectively search for "b".

Using any modulo anywhere could be detrimental on some hardware.

While reading the original post, I had a thought that might extended the usefulness of a simple `xor' approach by using prime keys and a schedule that depends on the value of the preceding hash. After thinking over it for a few moments, I decided that such a scheme would make recovery difficult (read expensive or impossible). I decided to try for a simpler scheme that depends only on the values that are being added to and removed from the existing hash.

I wrote this code in place by ripping off of the source in the original post. It may have a few syntax bugs, but I figured a little source would get my thoughts out a lot faster than a lot of text. I didn't try to optimize anything; a complete rewrite and some explicit caching would probably get a little better performance.

Let me know how it handles if you test it out. All you need to do is throw some primes at the `keys' array.

Soma

Code:

```static unsigned int keys [256 + 256]; unsigned int hash_search::hasher(const std::string &s, const unsigned &modulus) {         size_t len = s.size();   unsigned int result = 0;   unsigned int offset = s.at(0);         for (size_t i = 0; i < len; ++i) {                 result ^= keys[offset + s.at(i)];                 offset = s.at(i);         }         return result; } unsigned int hash_search::hasher_update(unsigned int old_hash, const char new_origin, const char new_finale, unsigned char origin, unsigned char finale) {         old_hash ^= keys[origin + origin];         old_hash ^= keys[origin + new_origin];         old_hash ^= keys[new_origin + new_origin];         old_hash ^= keys[finale + new_finale];         return old_hash; } size_t hash_search::find(std::string text, const std::string &pat, const unsigned &modulus) {         assert(modulus > 1);         size_t pat_len = pat.size();         if (!pat_len) return 0;                // pat is empty string         if (text.size() < pat_len) return std::string::npos;        // no need to search if text is smaller than pat         size_t last_text_index = text.size() - pat.size();         int h_val = hash_search::hasher(pat, modulus);                // value we will be searching for         int tmp = hash_search::hasher(text.substr(0, pat_len), modulus);         size_t i = 0;         // a number we'll need for applying the Rabin-Karp algorithm: alpha ^ (m - 1) in Skiena, AMD, p. 92         size_t base_power = 1;         for (int j = 0; j < pat_len - 1; ++j) {                 base_power = (base_power * BASE) % modulus;         }         // main testing loop         while (true) {                 // do an exhaustive comparison only AFTER checking for equality of hash values                 if (tmp == h_val && !text.substr(i, pat_len).compare(pat)) return i;                 /**                 The following is the decisive algorithm--cf. Skiena, ADM, p. 92                 It provides a short-cut for calculating the hash.                 Otherwise we're saving no time at all vs. exhaustive comparison                 */                 if (i == last_text_index) break;       tmp = hash_search::hasher_update(tmp, text.at(i + 1), text.at(i + pat_len), text.at(i), text.at(i + pat_len - 1));                 ++i;         }         return std::string::npos;        // not found }```
• 09-25-2010
phantomotap
Oh, and there is an easy way to speed up the `hasher_update' function by pulling out two of the `xor' lines if you change the way the actual search function updates a little. I was just too lazy for that.

Soma
• 09-25-2010
Aisthesis
very interesting. if we restrict ourselves to small ASCII keys could also be only 128 + 128, right?

and the primes in keys all just need to be bigger than 256, correct? and maybe also a minimum distance of 5 from a power of two?

and your hasher then basically keeps xor filtering result through primes located quickly just by viewing each new character in the string as an index in the keys array.

and you use the rabin-karp idea in that you un-xor (by xoring) at the front of the new string and performing the xor required at the end of it.

i'll try to digest and test this one sometime in the next couple of days. the xor-ing has to run way faster than mod-ing.
• 09-25-2010
phantomotap
Quote:

and the primes in keys all just need to be bigger than 256, correct?
No. The best potential for collision resistance is in the largest set of primes that fit in 32 bits, are spaced evenly (mostly) around the "dial", and have at least seven bits that aren't in common with any neighbor. You don't want mathematically similar strings hashing to the same value because of a quirk of your keys.

If you restrict yourself to the first 256 primes over 256, you are effectively guaranteeing collisions because the "distance" (in bits) between any two words may be accounted for by any other other character. The greater this "distance" the better your hash will be since it is a simple "xor" hash. If you added rotation (which can't be recovered from as easily) it would not be a real concern because more bits from the relevant components are better mixed.

Soma
• 09-25-2010
iMalc
Quote:

Originally Posted by phantomotap
That isn't the only problem with a simple `xor' approach.

Oh certainly not.

Quote:

For example, searching for "banana" would effectively search for "b".
I think you mean "ba" however that isn't directly a problem because your hash is always over 6 characters (in this case). Do you'd need 6 characters where some cancel out. E.g. cdbcad would do it (if that were a word) because the two c'd and two d's cancel out.

Quote:

Originally Posted by phantomotap
Using any modulo anywhere could be detrimental on some hardware.

agreed.

XOR hashing combined with a table for randomly distrubuting the bits of the input would probably overcome most the limitations of straight XOR hashing, at very little speed penalty.
• 09-26-2010
Aisthesis
is random bit distribution the only reason that we want keys[i] to be prime?

how's this as method for generating keys[]:
1) for the highest order bits, use a deterministic process to define slots for the primes s.t. the probability distribution of the highest order 8 bits is ideal (half have first bit set, half don't, half of each group of those has second bit set, etc.). then just pick a prime in the range of each slot.
2) hash those into the keys array using quadratic probing or double hashing and whatever good hash function might come to mind (i might use multiplication method, since that's one i haven't implemented yet)

sound ok?
• 09-26-2010
iMalc
You could simply fill the array of size n with 0 to n-1 and then perform a std::random_shuffle on it.
• 09-26-2010
Aisthesis
that makes sense--i might still do it the long way just for practice with various hashing methods.

do the primes i was suggesting sound reasonable for the contents of the array prior to shuffling?
• 09-26-2010
Aisthesis
first ignoring the (easy) issue of randomizing the order of the keys[] array, here's the way i'm initially populating it:

i implement it as a simple class:
Code:

```class PrimeArray {         private:                 unsigned * arr;                 size_t size;         public:                 PrimeArray(size_t _size = BASE + BASE);                 ~PrimeArray() {delete [] arr; }                 unsigned at(const size_t &i) const;         };```
then i modify my prime_generator() and define a simple lg() function:
Code:

```/**         Generates a prime in proximity to num         When upward is set to true (default), then the returned prime is >= num         When upward is set to false, then the returned prime is < num         @param num         @param num_is_min         @return a prime number         */         unsigned prime_generator(unsigned num, bool upward = true);         unsigned lg(unsigned num);        // finds log rounding down```
then i implement the class constructor like this (so far, still needs to get randomized order, and, as iMalc points out, that's a 1-liner using std::random_shuffle, although i'm going to make it more complicated just for practice with various hashing techniques):
Code:

```rabin_karp::PrimeArray::PrimeArray(size_t _size) {         assert(_size > 0);         size = _size;         arr = new unsigned[_size];         // find the number of higher order bits to worry about         unsigned high_bits = rabin_karp::lg(_size);         unsigned tmp = ~ 0U;        // all bits set         unsigned decrement = 1U << (32 - high_bits);         tmp -= decrement >> 1;         size_t i;         // populate the array with approximate (non-prime) values         for (i = 0; i < _size; ++i) {                 arr[i] = tmp;                 tmp -= decrement;         }         // populate the array with appropriate prime values                 for (i = 0; i < _size; ++i) {                 arr[i] = prime_generator(arr[i], (i + 1) % 2);                 if (i && arr[i] == arr[i - 1]) {                         arr[i] = prime_generator(arr[i], false);                 }         }         }```
• 09-26-2010
Aisthesis
just looking at a few entries in it, the distribution looks reasonable to me.
• 09-26-2010
Aisthesis
oddly enough, phantom, the xor implementation seems slower by almost a factor of 3 than the division method. On my laptop, I get somewhere around 1.8 with division and 4.5 with XOR on the first dataset, and 3.9 with division, 10.9 with XOR on the second dataset.

Here's my implementation of the XOR (mainly copied from your code but changing a couple of things and deleting irrelevant lines):

Code:

```size_t rabin_karp::xor_hasher(const std::string &s, const rabin_karp::PrimeArray &arr) {         size_t len = s.size();         assert (len > 0);         unsigned int result = 0U;                 unsigned int offset = s.at(0);         for (size_t i = 0; i < len; ++i) {                 result ^= arr.at(offset + s.at(i));                 offset = s.at(i);         }                 return result; } size_t rabin_karp::xor_hasher_update(unsigned old_hash, const char &new_origin, const char &new_finale,                 const char &old_origin, const char &old_finale, const PrimeArray &arr) {         old_hash ^= arr.at(old_origin + old_origin);         old_hash ^= arr.at(old_origin + new_origin);         old_hash ^= arr.at(new_origin + new_origin);         old_hash ^= arr.at(old_finale + new_finale);         return old_hash; } size_t rabin_karp::xor_find(std::string text, const std::string &pat, const rabin_karp::PrimeArray &arr) {         size_t pat_len = pat.size();         if (!pat_len) return 0;                // pat is empty string         if (text.size() < pat_len) return std::string::npos;        // no need to search if text is smaller than pat         size_t last_text_index = text.size() - pat.size();         int h_val = rabin_karp::xor_hasher(pat, arr);                // value we will be searching for         int tmp = rabin_karp::xor_hasher(text.substr(0, pat_len), arr);         size_t i = 0;         // main testing loop         while (true) {                 // do an exhaustive comparison only AFTER checking for equality of hash values                 if (tmp == h_val && !text.substr(i, pat_len).compare(pat)) return i;                 /**                 The following is the decisive algorithm--cf. Skiena, ADM, p. 92                 It provides a short-cut for calculating the hash.                 Otherwise we're saving no time at all vs. exhaustive comparison                 */                 if (i == last_text_index) break;                 tmp = xor_hasher_update(tmp, text.at(i + 1), text.at(i + pat_len), text.at(i), text.at(i + pat_len - 1), arr);                                 ++i;         }                 return std::string::npos;        // not found }```
And here's the code i used for testing:
Code:

```inline double cpu_time() { return double(clock()) / CLOCKS_PER_SEC; } int main() {         srand(time(NULL));         rabin_karp::PrimeArray arr;         size_t modulus = rabin_karp::prime_generator(1500);         double start, runtime;         string str1 = random_string(10000000);         string str2 = random_string(100000);         str1.append(str2);         cout << "Searching for pattern using Rabin-Karp with division method\n";         start = cpu_time();         size_t index = rabin_karp::div_find(str1, str2, modulus);         runtime = cpu_time() - start;         if (index != string::npos) cout << "index is " << index;         else cout << "pattern not found";         cout << endl;         cout << "runtime with modulus " << modulus << ": " << runtime << endl;         cout << "Searching for pattern using Rabin-Karp with XOR method\n";         start = cpu_time();         index = rabin_karp::xor_find(str1, str2, arr);         runtime = cpu_time() - start;         if (index != string::npos) cout << "index is " << index;         else cout << "pattern not found";         cout << endl;         cout << "runtime with XOR method: " << runtime << endl;         cout << "Searching for pattern using built-in find()\n";         start = cpu_time();         index = str1.find(str2);         runtime = cpu_time() - start;         if (index != string::npos) cout << "index is " << index;         else cout << "pattern not found";         cout << endl;         cout << "runtime: " << runtime << endl;         cout << "Creating a string on which hash searching should perform well\n";         str1 = "";         for (int i = 0; i < 500000; ++i) {                 str1.append(str2.substr(0, 50));         }         str1.append(str2);         cout << "Searching for pattern using Rabin-Karp division method\n";         start = cpu_time();         index = rabin_karp::div_find(str1, str2, modulus);         runtime = cpu_time() - start;         if (index != string::npos) cout << "index is " << index;         else cout << "pattern not found";         cout << endl;         cout << "runtime with modulus " << modulus << ": " << runtime << endl;         cout << "Searching for pattern using Rabin-Karp with XOR method\n";         start = cpu_time();         index = rabin_karp::xor_find(str1, str2, arr);         runtime = cpu_time() - start;         if (index != string::npos) cout << "index is " << index;         else cout << "pattern not found";         cout << endl;         cout << "runtime with XOR method: " << runtime << endl;         cout << "Searching for pattern using built-in find()\n";         start = cpu_time();         index = str1.find(str2);         runtime = cpu_time() - start;         if (index != string::npos) cout << "index is " << index;         else cout << "pattern not found";         cout << endl;         cout << "runtime: " << runtime << endl;         //*/         return 0; }```
Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last