Rabin-Karp algorithm

This is a discussion on Rabin-Karp algorithm within the C++ Programming forums, part of the General Programming Boards category; This is an algorithm for finding a pattern substring in a text string. I got it from Skiena, Design Algorithm ...

  1. #1
    Registered User
    Join Date
    May 2009
    Posts
    242

    Rabin-Karp algorithm

    This is an algorithm for finding a pattern substring in a text string. I got it from Skiena, Design Algorithm Manual, 2008, pp. 91f. It works by hashing the substring and starts getting nice results when the substring is fairly long because it uses the hash value derived for the previous test substring by just chopping off the first character of the hash and adding on 1 new character.

    My questions are how to optimize performance by choosing an appropriate modulus for hashing (big modulus should mean fewer false positives but i would expect it to make multiplication and modding slower--?) and exactly when it should start being faster than a character-by-character search. My initial experiments suggest that this implementation isn't as fast as the find() method in <string>, but it's pretty fast, even on a string of length 10M with the pattern substring of length 100K tacked on at the end.

    Anyhow, here's my header file:
    Code:
    #ifndef STRING_SEARCH_H
    #define STRING_SEARCH_H
    
    #include <string>
    
    namespace hash_search {
    
    	const size_t BASE = 128; // small ASCII
    	/**
    	@param s String for which to return a hash value
    	@param modulus number (preferably prime) to use for hashing
    	@return Hash value
    	*/
    	int hasher(const std::string &s, const unsigned &modulus);
    
    	/**
    	There should be some kind of "golden value" for modulus
    	The bigger modulus is, the more accurate the hasher, 
    	but the hash test will presumably also tend to get slower.
    	@param text String to be tested for instances of pat
    	@param pat Pattern string for which to test
    	@param modulus for choosing the method by which to hash before exhaustive comparisons
    	@return Index of the first instance of pat found in text
    	npos if pat not found
    	0 if pat is empty string
    	*/
    	size_t find(const std::string &text, const std::string &pat, const unsigned &modulus); 
    
    	unsigned prime_generator(unsigned min);	// will generate the next prime at least as big as min
    }
    
    #endif
    and here's the implementation (not very proud of my prime generator, but it works, and that's no what i'm trying to optimize)

    Code:
    #include "string_search.h"
    #include <string>
    #include <cmath>
    #include <cassert>
    
    int hash_search::hasher(const std::string &s, const unsigned &modulus) {
    	int result = 0;
    	size_t len = s.size();
    
    	for (size_t i = 0; i < len; ++i) {
    		result = (result * hash_search::BASE + s.at(i)) % modulus;
    	}
    	return result;
    }
    
    size_t hash_search::find(const std::string &text, const std::string &pat, const unsigned &modulus) {
    	assert(modulus > 1);
    	size_t pat_len = pat.size();
    	if (!pat_len) return 0;		// pat is empty string
    	if (text.size() < pat_len) return std::string::npos;	// no need to search if text is smaller than pat
    
    	size_t last_text_index = text.size() - pat.size();
    	int h_val = hash_search::hasher(pat, modulus);		// value we will be searching for
    	int tmp = hash_search::hasher(text.substr(0, pat_len), modulus);
    	size_t i = 0;
    
    	// a number we'll need for applying the Rabin-Karp algorithm: alpha ^ (m - 1) in Skiena, AMD, p. 92
    	size_t base_power = 1;
    	for (int j = 0; j < pat_len - 1; ++j) {
    		base_power = (base_power * BASE) % modulus;
    	}
    
    	// main testing loop
    	while (true) {
    		// do an exhaustive comparison only AFTER checking for equality of hash values
    		if (tmp == h_val && !text.substr(i, pat_len).compare(pat)) return i;
    		/**
    		The following is the decisive algorithm--cf. Skiena, ADM, p. 92
    		It provides a short-cut for calculating the hash.
    		Otherwise we're saving no time at all vs. exhaustive comparison
    		*/
    		if (i == last_text_index) break;
    		tmp += (modulus - base_power) * text.at(i);
    		tmp %= modulus;
    		tmp *= BASE;
    		tmp %= modulus;
    		tmp += text.at(i + pat_len);
    		tmp %= modulus;
    		++i;
    	}
    	
    	return std::string::npos;	// not found
    }
    
    unsigned hash_search::prime_generator(unsigned min) {
    	assert(min > 1);
    	unsigned max_factor, i;
    
    	while (true) {
    		max_factor = std::sqrt((double)min);
    		for (i = 2; i <= max_factor; ++i) {
    			if (min % i == 0) break;
    		}
    		if (i > max_factor) return min;	// result is prime
    		++min;
    	}
    }

  2. #2
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,302
    So you're main focus is of course the main testing loop.
    Basicallly the multiplications and modding take the same amount of time regardless of the operand values, sorry. In practice your plan is to just find a large value that works well and then hard code in this value right? Then you can drop the inefficient prime generator.

    The main issue here is that you're needing three mod's per character, and divides or mods are quite slow.
    This kind of thing will only be a win when the text to search through contains several bits that are a prefix of the code being searched for, (the longer the more pronounced).

    I expect the Boyer Moore to be much faster than this one, in practice.
    Or alternatively, even using something as simple as an XOR hash with this technique, could improve the speed over the naieve algorithm, whist adding very little overhead.

    Also, don't pass modulus by const-reference. For built-in types, passing by value is preferred and can be more efficient.
    Last edited by iMalc; 09-24-2010 at 02:17 PM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  3. #3
    Registered User
    Join Date
    May 2009
    Posts
    242
    just by running it on some big strings with different kinds of structures, i came up with an empirically
    "optimal" modulus of something like 1511, although on my minimal sample sets i saw very little difference using primes above 700 and up to 8000-ish (i only even tested at values very far away from powers of 2, probably being too picky about that but i don't think it matters).

    i gather that in addition to boyer moore, knuth morris pratt is in the running, although they do say that a variant of rabin karp works for something like plagiarism detection--where you're searching for a certain pattern type rather than an exact substring match.

    could you perhaps elaborate a little on the xor suggestion? i'm not seeing how you could use xor hashing in the rabin-karp, since it isn't a homomorphism, as modulus is (?).

  4. #4
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,302
    You start with the xor of pat_len characters, then while this hash does not match the hash of the search string, you xor that hash with the first byte that is currently part of the hash, effectively removing that byte's effect on the hash, and also xor in the next unseen byte. After doing this once you move from being a hash of the bytes from 0...pat_len-1 to 1...pat_len.

    I.e Let H0 be the hash of bytes X0 to Xn-1. H1 (the hash of bytes X0 to Xn-1) can be obtained from H0 ^ X0 ^ Xn,
    or in general Hm can be obtained from Hm-1 ^ Xm-1 ^ Xn+m. Sorry about the lack of subscripts there.
    We're just exploiting the associativity of the XOR operation here, nothing fancy.

    Admittedly that's a very poor hash. It only looks for what is possibly the same characters, though they could be in a totally different order.
    It could be made better by rotating each byte by its index modulo 8 for example, when both XORing it in and XORing it out, which would make "abc" hash differently to "cba".
    Last edited by iMalc; 09-25-2010 at 01:26 PM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  5. #5
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    4,231
    That isn't the only problem with a simple `xor' approach.

    For example, searching for "banana" would effectively search for "b".

    Using any modulo anywhere could be detrimental on some hardware.

    While reading the original post, I had a thought that might extended the usefulness of a simple `xor' approach by using prime keys and a schedule that depends on the value of the preceding hash. After thinking over it for a few moments, I decided that such a scheme would make recovery difficult (read expensive or impossible). I decided to try for a simpler scheme that depends only on the values that are being added to and removed from the existing hash.

    I wrote this code in place by ripping off of the source in the original post. It may have a few syntax bugs, but I figured a little source would get my thoughts out a lot faster than a lot of text. I didn't try to optimize anything; a complete rewrite and some explicit caching would probably get a little better performance.

    Let me know how it handles if you test it out. All you need to do is throw some primes at the `keys' array.

    Soma

    Code:
    static unsigned int keys [256 + 256];
    
    unsigned int hash_search::hasher(const std::string &s, const unsigned &modulus) {
    	size_t len = s.size();
       unsigned int result = 0;
       unsigned int offset = s.at(0);
    	for (size_t i = 0; i < len; ++i) {
    		result ^= keys[offset + s.at(i)];
    		offset = s.at(i);
    	}
    	return result;
    }
    
    unsigned int hash_search::hasher_update(unsigned int old_hash, const char new_origin, const char new_finale, unsigned char origin, unsigned char finale) {
    	old_hash ^= keys[origin + origin];
    	old_hash ^= keys[origin + new_origin];
    	old_hash ^= keys[new_origin + new_origin];
    	old_hash ^= keys[finale + new_finale];
    	return old_hash;
    }
    
    size_t hash_search::find(std::string text, const std::string &pat, const unsigned &modulus) {
    	assert(modulus > 1);
    	size_t pat_len = pat.size();
    	if (!pat_len) return 0;		// pat is empty string
    	if (text.size() < pat_len) return std::string::npos;	// no need to search if text is smaller than pat
    
    	size_t last_text_index = text.size() - pat.size();
    	int h_val = hash_search::hasher(pat, modulus);		// value we will be searching for
    	int tmp = hash_search::hasher(text.substr(0, pat_len), modulus);
    	size_t i = 0;
    
    	// a number we'll need for applying the Rabin-Karp algorithm: alpha ^ (m - 1) in Skiena, AMD, p. 92
    	size_t base_power = 1;
    	for (int j = 0; j < pat_len - 1; ++j) {
    		base_power = (base_power * BASE) % modulus;
    	}
    
    	// main testing loop
    	while (true) {
    		// do an exhaustive comparison only AFTER checking for equality of hash values
    		if (tmp == h_val && !text.substr(i, pat_len).compare(pat)) return i;
    		/**
    		The following is the decisive algorithm--cf. Skiena, ADM, p. 92
    		It provides a short-cut for calculating the hash.
    		Otherwise we're saving no time at all vs. exhaustive comparison
    		*/
    		if (i == last_text_index) break;
          tmp = hash_search::hasher_update(tmp, text.at(i + 1), text.at(i + pat_len), text.at(i), text.at(i + pat_len - 1));
    		++i;
    	}
    
    	return std::string::npos;	// not found
    }

  6. #6
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    4,231
    Oh, and there is an easy way to speed up the `hasher_update' function by pulling out two of the `xor' lines if you change the way the actual search function updates a little. I was just too lazy for that.

    Soma

  7. #7
    Registered User
    Join Date
    May 2009
    Posts
    242
    very interesting. if we restrict ourselves to small ASCII keys could also be only 128 + 128, right?

    and the primes in keys all just need to be bigger than 256, correct? and maybe also a minimum distance of 5 from a power of two?

    and your hasher then basically keeps xor filtering result through primes located quickly just by viewing each new character in the string as an index in the keys array.

    and you use the rabin-karp idea in that you un-xor (by xoring) at the front of the new string and performing the xor required at the end of it.

    i'll try to digest and test this one sometime in the next couple of days. the xor-ing has to run way faster than mod-ing.

  8. #8
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    4,231
    and the primes in keys all just need to be bigger than 256, correct?
    No. The best potential for collision resistance is in the largest set of primes that fit in 32 bits, are spaced evenly (mostly) around the "dial", and have at least seven bits that aren't in common with any neighbor. You don't want mathematically similar strings hashing to the same value because of a quirk of your keys.

    If you restrict yourself to the first 256 primes over 256, you are effectively guaranteeing collisions because the "distance" (in bits) between any two words may be accounted for by any other other character. The greater this "distance" the better your hash will be since it is a simple "xor" hash. If you added rotation (which can't be recovered from as easily) it would not be a real concern because more bits from the relevant components are better mixed.

    Soma
    Last edited by phantomotap; 09-25-2010 at 08:53 PM.

  9. #9
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,302
    Quote Originally Posted by phantomotap View Post
    That isn't the only problem with a simple `xor' approach.
    Oh certainly not.

    For example, searching for "banana" would effectively search for "b".
    I think you mean "ba" however that isn't directly a problem because your hash is always over 6 characters (in this case). Do you'd need 6 characters where some cancel out. E.g. cdbcad would do it (if that were a word) because the two c'd and two d's cancel out.

    Quote Originally Posted by phantomotap View Post
    Using any modulo anywhere could be detrimental on some hardware.
    agreed.

    XOR hashing combined with a table for randomly distrubuting the bits of the input would probably overcome most the limitations of straight XOR hashing, at very little speed penalty.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  10. #10
    Registered User
    Join Date
    May 2009
    Posts
    242
    is random bit distribution the only reason that we want keys[i] to be prime?

    how's this as method for generating keys[]:
    1) for the highest order bits, use a deterministic process to define slots for the primes s.t. the probability distribution of the highest order 8 bits is ideal (half have first bit set, half don't, half of each group of those has second bit set, etc.). then just pick a prime in the range of each slot.
    2) hash those into the keys array using quadratic probing or double hashing and whatever good hash function might come to mind (i might use multiplication method, since that's one i haven't implemented yet)

    sound ok?

  11. #11
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,302
    You could simply fill the array of size n with 0 to n-1 and then perform a std::random_shuffle on it.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  12. #12
    Registered User
    Join Date
    May 2009
    Posts
    242
    that makes sense--i might still do it the long way just for practice with various hashing methods.

    do the primes i was suggesting sound reasonable for the contents of the array prior to shuffling?

  13. #13
    Registered User
    Join Date
    May 2009
    Posts
    242
    first ignoring the (easy) issue of randomizing the order of the keys[] array, here's the way i'm initially populating it:

    i implement it as a simple class:
    Code:
    class PrimeArray {
    	private:
    		unsigned * arr;
    		size_t size;
    	public:
    		PrimeArray(size_t _size = BASE + BASE);
    		~PrimeArray() {delete [] arr; }
    
    		unsigned at(const size_t &i) const;
    	};
    then i modify my prime_generator() and define a simple lg() function:
    Code:
    /**
    	Generates a prime in proximity to num
    	When upward is set to true (default), then the returned prime is >= num
    	When upward is set to false, then the returned prime is < num
    	@param num
    	@param num_is_min
    	@return a prime number
    	*/
    	unsigned prime_generator(unsigned num, bool upward = true);
    	unsigned lg(unsigned num);	// finds log rounding down
    then i implement the class constructor like this (so far, still needs to get randomized order, and, as iMalc points out, that's a 1-liner using std::random_shuffle, although i'm going to make it more complicated just for practice with various hashing techniques):
    Code:
    rabin_karp::PrimeArray::PrimeArray(size_t _size) {
    	assert(_size > 0);
    	size = _size;
    	arr = new unsigned[_size];
    
    	// find the number of higher order bits to worry about
    	unsigned high_bits = rabin_karp::lg(_size);
    
    	unsigned tmp = ~ 0U;	// all bits set
    	unsigned decrement = 1U << (32 - high_bits);
    	tmp -= decrement >> 1;
    	size_t i;
    
    	// populate the array with approximate (non-prime) values
    	for (i = 0; i < _size; ++i) {
    		arr[i] = tmp;
    		tmp -= decrement;
    	}
    
    	// populate the array with appropriate prime values
    	
    	for (i = 0; i < _size; ++i) {
    		arr[i] = prime_generator(arr[i], (i + 1) % 2);
    		if (i && arr[i] == arr[i - 1]) {
    			arr[i] = prime_generator(arr[i], false);
    		}
    	}
    	
    }

  14. #14
    Registered User
    Join Date
    May 2009
    Posts
    242
    just looking at a few entries in it, the distribution looks reasonable to me.

  15. #15
    Registered User
    Join Date
    May 2009
    Posts
    242
    oddly enough, phantom, the xor implementation seems slower by almost a factor of 3 than the division method. On my laptop, I get somewhere around 1.8 with division and 4.5 with XOR on the first dataset, and 3.9 with division, 10.9 with XOR on the second dataset.

    Here's my implementation of the XOR (mainly copied from your code but changing a couple of things and deleting irrelevant lines):

    Code:
    size_t rabin_karp::xor_hasher(const std::string &s, const rabin_karp::PrimeArray &arr) {
    	size_t len = s.size();
    	assert (len > 0);
    
    	unsigned int result = 0U;
    	
    	unsigned int offset = s.at(0);
    	for (size_t i = 0; i < len; ++i) {
    		result ^= arr.at(offset + s.at(i));
    		offset = s.at(i);
    	}
    	
    	return result;
    }
    
    size_t rabin_karp::xor_hasher_update(unsigned old_hash, const char &new_origin, const char &new_finale, 
    		const char &old_origin, const char &old_finale, const PrimeArray &arr) {
    	old_hash ^= arr.at(old_origin + old_origin);
    	old_hash ^= arr.at(old_origin + new_origin);
    	old_hash ^= arr.at(new_origin + new_origin);
    	old_hash ^= arr.at(old_finale + new_finale);
    	return old_hash;
    }
    
    size_t rabin_karp::xor_find(std::string text, const std::string &pat, const rabin_karp::PrimeArray &arr) {
    	size_t pat_len = pat.size();
    	if (!pat_len) return 0;		// pat is empty string
    	if (text.size() < pat_len) return std::string::npos;	// no need to search if text is smaller than pat
    
    	size_t last_text_index = text.size() - pat.size();
    	int h_val = rabin_karp::xor_hasher(pat, arr);		// value we will be searching for
    	int tmp = rabin_karp::xor_hasher(text.substr(0, pat_len), arr);
    	size_t i = 0;
    
    	// main testing loop
    	while (true) {
    		// do an exhaustive comparison only AFTER checking for equality of hash values
    		if (tmp == h_val && !text.substr(i, pat_len).compare(pat)) return i;
    		/**
    		The following is the decisive algorithm--cf. Skiena, ADM, p. 92
    		It provides a short-cut for calculating the hash.
    		Otherwise we're saving no time at all vs. exhaustive comparison
    		*/
    		if (i == last_text_index) break;
    		tmp = xor_hasher_update(tmp, text.at(i + 1), text.at(i + pat_len), text.at(i), text.at(i + pat_len - 1), arr);
    		
    		++i;
    	}
    	
    	return std::string::npos;	// not found
    }
    And here's the code i used for testing:
    Code:
    inline double cpu_time() { return double(clock()) / CLOCKS_PER_SEC; }
    
    int main() {
    	srand(time(NULL));
    
    	rabin_karp::PrimeArray arr;
    
    	size_t modulus = rabin_karp::prime_generator(1500);
    	double start, runtime;
    
    	string str1 = random_string(10000000);
    	string str2 = random_string(100000);
    	str1.append(str2);
    	cout << "Searching for pattern using Rabin-Karp with division method\n";
    
    	start = cpu_time();
    	size_t index = rabin_karp::div_find(str1, str2, modulus);
    	runtime = cpu_time() - start;
    
    	if (index != string::npos) cout << "index is " << index;
    	else cout << "pattern not found";
    	cout << endl;
    
    	cout << "runtime with modulus " << modulus << ": " << runtime << endl;
    
    	cout << "Searching for pattern using Rabin-Karp with XOR method\n";
    
    	start = cpu_time();
    	index = rabin_karp::xor_find(str1, str2, arr);
    	runtime = cpu_time() - start;
    
    	if (index != string::npos) cout << "index is " << index;
    	else cout << "pattern not found";
    	cout << endl;
    
    	cout << "runtime with XOR method: " << runtime << endl;
    
    	cout << "Searching for pattern using built-in find()\n";
    	start = cpu_time();
    	index = str1.find(str2);
    	runtime = cpu_time() - start;
    
    	if (index != string::npos) cout << "index is " << index;
    	else cout << "pattern not found";
    	cout << endl;
    	cout << "runtime: " << runtime << endl;
    
    	cout << "Creating a string on which hash searching should perform well\n";
    	str1 = "";
    	for (int i = 0; i < 500000; ++i) {
    		str1.append(str2.substr(0, 50));
    	}
    	str1.append(str2);
    
    	cout << "Searching for pattern using Rabin-Karp division method\n";
    
    	start = cpu_time();
    	index = rabin_karp::div_find(str1, str2, modulus);
    	runtime = cpu_time() - start;
    
    	if (index != string::npos) cout << "index is " << index;
    	else cout << "pattern not found";
    	cout << endl;
    	cout << "runtime with modulus " << modulus << ": " << runtime << endl;
    
    	cout << "Searching for pattern using Rabin-Karp with XOR method\n";
    
    	start = cpu_time();
    	index = rabin_karp::xor_find(str1, str2, arr);
    	runtime = cpu_time() - start;
    
    	if (index != string::npos) cout << "index is " << index;
    	else cout << "pattern not found";
    	cout << endl;
    
    	cout << "runtime with XOR method: " << runtime << endl;
    	cout << "Searching for pattern using built-in find()\n";
    	start = cpu_time();
    	index = str1.find(str2);
    	runtime = cpu_time() - start;
    
    	if (index != string::npos) cout << "index is " << index;
    	else cout << "pattern not found";
    	cout << endl;
    	cout << "runtime: " << runtime << endl;
    	//*/
    
    	return 0;
    }

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Implement of a Fast Time Series Evaluation Algorithm
    By BiGreat in forum C Programming
    Replies: 7
    Last Post: 12-04-2007, 01:30 AM
  2. Replies: 4
    Last Post: 12-10-2006, 06:08 PM
  3. Binary Search Trees Part III
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 16
    Last Post: 10-02-2004, 03:00 PM
  4. Request for comments
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-02-2004, 09:33 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21