This is an algorithm for finding a pattern substring in a text string. I got it from Skiena, Design Algorithm Manual, 2008, pp. 91f. It works by hashing the substring and starts getting nice results when the substring is fairly long because it uses the hash value derived for the previous test substring by just chopping off the first character of the hash and adding on 1 new character.

My questions are how to optimize performance by choosing an appropriate modulus for hashing (big modulus should mean fewer false positives but i would expect it to make multiplication and modding slower--?) and exactly when it should start being faster than a character-by-character search. My initial experiments suggest that this implementation isn't as fast as the find() method in <string>, but it's pretty fast, even on a string of length 10M with the pattern substring of length 100K tacked on at the end.

Anyhow, here's my header file:

and here's the implementation (not very proud of my prime generator, but it works, and that's no what i'm trying to optimize)Code:#ifndef STRING_SEARCH_H #define STRING_SEARCH_H #include <string> namespace hash_search { const size_t BASE = 128; // small ASCII /** @param s String for which to return a hash value @param modulus number (preferably prime) to use for hashing @return Hash value */ int hasher(const std::string &s, const unsigned &modulus); /** There should be some kind of "golden value" for modulus The bigger modulus is, the more accurate the hasher, but the hash test will presumably also tend to get slower. @param text String to be tested for instances of pat @param pat Pattern string for which to test @param modulus for choosing the method by which to hash before exhaustive comparisons @return Index of the first instance of pat found in text npos if pat not found 0 if pat is empty string */ size_t find(const std::string &text, const std::string &pat, const unsigned &modulus); unsigned prime_generator(unsigned min); // will generate the next prime at least as big as min } #endif

Code:#include "string_search.h" #include <string> #include <cmath> #include <cassert> int hash_search::hasher(const std::string &s, const unsigned &modulus) { int result = 0; size_t len = s.size(); for (size_t i = 0; i < len; ++i) { result = (result * hash_search::BASE + s.at(i)) % modulus; } return result; } size_t hash_search::find(const std::string &text, const std::string &pat, const unsigned &modulus) { assert(modulus > 1); size_t pat_len = pat.size(); if (!pat_len) return 0; // pat is empty string if (text.size() < pat_len) return std::string::npos; // no need to search if text is smaller than pat size_t last_text_index = text.size() - pat.size(); int h_val = hash_search::hasher(pat, modulus); // value we will be searching for int tmp = hash_search::hasher(text.substr(0, pat_len), modulus); size_t i = 0; // a number we'll need for applying the Rabin-Karp algorithm: alpha ^ (m - 1) in Skiena, AMD, p. 92 size_t base_power = 1; for (int j = 0; j < pat_len - 1; ++j) { base_power = (base_power * BASE) % modulus; } // main testing loop while (true) { // do an exhaustive comparison only AFTER checking for equality of hash values if (tmp == h_val && !text.substr(i, pat_len).compare(pat)) return i; /** The following is the decisive algorithm--cf. Skiena, ADM, p. 92 It provides a short-cut for calculating the hash. Otherwise we're saving no time at all vs. exhaustive comparison */ if (i == last_text_index) break; tmp += (modulus - base_power) * text.at(i); tmp %= modulus; tmp *= BASE; tmp %= modulus; tmp += text.at(i + pat_len); tmp %= modulus; ++i; } return std::string::npos; // not found } unsigned hash_search::prime_generator(unsigned min) { assert(min > 1); unsigned max_factor, i; while (true) { max_factor = std::sqrt((double)min); for (i = 2; i <= max_factor; ++i) { if (min % i == 0) break; } if (i > max_factor) return min; // result is prime ++min; } }