Rabin-Karp algorithm

**phantomotap** · 09-27-2010

the xor implementation seems slower by almost a factor of 3 than the division method

That is so far from the realm of possibility that it isn't worth discussing.

Here's my implementation of the XOR

And it is garbage. Look at the massive overhead you've imposed!

mainly copied from your code but changing a couple of things

Don't dare imply that filth came from me!

The only thing you've kept from my code is the idea.

The algorithm is at best always an "O(n)" job. What did you expect from a different implementation? You can only easily get a slight improvement by reducing instructions or using "cheaper" instructions. You've done neither of those.

Soma

**Aisthesis** · 09-27-2010

uh... phantom, aside from some names and deleting unused code, the only thing i really changed was making the keys array into a class so that i could use the constructor to make it fit the desired criteria.

passing a pointer to the object into the functions surely can't add anything substantial to runtime.

i really thing the difference is that your xor-ing full 32-bit numbers whereas with division method your mod-ing only relatively small numbers.

but if you think that making the array into a class (that has almost no additional instruction at the points that would influence the runtime of the find() operation), feel free to implement and compare... i might try it myself if i have the time, as it should be easy to construct from the code i already have. i'm just not seeing why the PrimeArray class would slow it down very much at all. the only thing complicated that it does is set up the various primes, and that happens before i start measuring the runtime.

**Aisthesis** · 09-27-2010

phantom, using your exact code (deleting superfluous lines from my original that have no relevance for xor and running it through a global array rather than the PrimeArray class), it improves (about as i expected) slightly but is still slower than the division method. Sample run:

On the first random string and pattern:
Div method: 1.691
xor method with PrimeArray class: 4.472
your code: 3.32

On the second random string and pattern:
Div method: 3.866
xor method with PrimeArray class: 10.899
your code: 7.485

so, your code takes roughly twice as long as the division method as implemented. also: no need to get all upset about it. i expected it to be faster myself, and could only explain that it wasn't from the full 32-bit xor vs. mod-ing and multiplying relatively small (at most 9- to 10-bit) numbers.

i have no doubt that xor is faster than division on a level playing field.

**phantomotap** · 09-27-2010

I'll make it easy for you; either your tests or your results are flawed. You either did not test the source I posted or you did not fill the key array as per my instructions.

[Edit]
I'll even go one step further in explaining how you have failed to follow my instructions assuming those results are accurate.

Your chosen keys in combination are producing many more false positives than the "modulus" method produces. The combination of values is requiring the "xor" implementation to try the linear comparison far more often than is necessary.
[/Edit]

I can imagine four dependent uses of "xor" being insignificantly slower than three dependent uses of "modulus". Integer division is very fast on desktop processors these days.

I can't imagine four dependent uses of "xor" being significantly slower than three dependent uses of "modulus". It doesn't happen. It is easy and cheap to implement a very fast "xor".

no need to get all upset about it.

I'm not. I'm upset that you contributed that parody of design to me.

I couldn't care less if the "xor" approach was slower. The use of the array has a small chance of padding the difference between the two implementations. With that in mind, the overhead of the pointless design is that much more important in deciding.

Soma

**Aisthesis** · 09-27-2010

Originally Posted by phantomotap

No. The best potential for collision resistance is in the largest set of primes that fit in 32 bits, are spaced evenly (mostly) around the "dial", and have at least seven bits that aren't in common with any neighbor. You don't want mathematically similar strings hashing to the same value because of a quirk of your keys.

If you restrict yourself to the first 256 primes over 256, you are effectively guaranteeing collisions because the "distance" (in bits) between any two words may be accounted for by any other other character. The greater this "distance" the better your hash will be since it is a simple "xor" hash. If you added rotation (which can't be recovered from as easily) it would not be a real concern because more bits from the relevant components are better mixed.

Soma

the only part i didn't test for was not having at least 7 bits in common with any neighbor. i did get uniform distribution and just used std::random_shuffle() after all and left it at that.

however, as for false positives (which i'm glad you brought up because i'd been meaning to test for that). here's the breakdown, again including runtimes on a sample run:

first string and pattern:
division method: 6561 false positives (string is 10M+ characters long here), time 1.671
xor with PrimeArray class: 619 false positives, time 4.425
xor with global keys array: 619 false positives (the array contents are the same as the contents of the array in the PrimeArray class), time 3.068

second string and pattern:
division method: 65 false positives, time 3.847
xor with PrimeArray class: 12 false positives, time 10.871
xor with keys array: 12 false positives, time 7.584

the false positives don't seem to be holding things up much.

as for attributing the code to you: except for the keys array, which you didn't code, i really just copy pasted, then changed some names so that it wouldn't conflict with code i already had. i also very seriously doubt that quirks in the keys array are slowing things down in any significant way (a different random ordering of the same primes comes up every time i run the program, but the results have always been pretty similar).

**iMalc** · 09-28-2010

Oh, then if I'm reading this right, the XOR method is giving fewer false positives. That means it's doing a better job than the method with the mods in it. I would have thought that the only reason the XOR method could be slower would be if the false positive rate was significantly higher, meaning it does more work overall. If that's not the case then I agree that something is very odd with those numbers. You should be able to perform somewhere on the order of about 20 XORs in the time it takes for just one mod operation.

I've noticed you make extensive use of .at(), rather than simply array indexing. Whilst .at() can be safer when there's some chance that the index might be out of bounds, don't go overboard with defensive programming. Almost any use of .at() I see in code says to me that the person who wrote it was unsure of what they were doing and didn't know whether they had their boundary conditions etc correct. It's better to know you have them correct and then safely just use array indexing.
Excesive use of .at() is like walking around wrapped in a dozen layers of bubble-wrap all day, in case you fall over, instead of fixing the step with that loose board.

Also, please get out of that habbit of passing built-in types by const-reference. Doing that for a char is just ridiculous and potentially wasteful.

**Aisthesis** · 09-28-2010

the .at() gets eliminated in the array version (which speeds it up to double the time of the div method), at least for the array. and all other instances (with the string) are identical for all 3 versions. so, while it should speed it up a little, it will speed up all of them uniformly.

**Aisthesis** · 09-28-2010

substituting operator[] for at() in all character retrievals from string, here's a sample run:

first string and pattern:
div method: 1.679
xor method with global array: 3.087

second string and pattern:
div method: 3.89
xor method with global array: 7.528

**Aisthesis** · 09-28-2010

here's my current code for hasher() and hasher_update():

Code:

size_t phantom::xor_hasher(const std::string &s) {
	size_t len = s.size();
	assert (len > 0);

	unsigned int result = 0U;
	
	unsigned int offset = s[0];
	for (size_t i = 0; i < len; ++i) {
		result ^= keys[offset + s[i]];
		offset = s[i];
	}
	
	return result;
}

size_t phantom::xor_hasher_update(unsigned old_hash, const char &new_origin, const char &new_finale, 
		const char &old_origin, const char &old_finale) {
	old_hash ^= keys[old_origin + old_origin];
	old_hash ^= keys[old_origin + new_origin];
	old_hash ^= keys[new_origin + new_origin];
	old_hash ^= keys[old_finale + new_finale];
	return old_hash;
}

if we want to improve speed here, why not make the keys array only half the current size (BASE rather than BASE + BASE) change hasher to this:

Code:

size_t phantom::xor_hasher(const std::string &s) {
	size_t len = s.size();
	assert (len > 0);

	unsigned int result = 0U;
	
	// unsigned int offset = s[0];
	for (size_t i = 0; i < len; ++i) {
		result ^= keys[s[i]];   // offset eliminated
		// offset = s[i];
	}
	
	return result;
}

then hasher_update() gets much shorter and faster.

ah, ok, that won't work, although i'm leaving the post in here, because then permutations of the same string get hashed to the same value.

we could maybe cut off 1 line by setting offset = 0 for the first character, but that's only going to be a small speed improvement.

Thread: Rabin-Karp algorithm

Thread Tools

Search Thread

Display

Similar Threads

Implement of a Fast Time Series Evaluation Algorithm

Euclid Algorithm (extended)Part2 Doubt about pointers - INITIALIZATION

Binary Search Trees Part III

Request for comments