searching vector using maps

**MutantJohn** · 09-21-2014

I feel like the biggest challenge in this problem is searching the keys for substrings or partial matches of what the user enters into a search.

If the code was supposed to take exact and literal text then this problem would be quite simple but as it stands now, I think the biggest obstacle to overcome is how we search for the proper key.

A char is any one of 256 values. There's a very simple hash method I learned where you take an int and reverse the bits, adding everything up. This is good because it's degenerate in the sense that a + b + c will hash to the same value as c + a + b or any permutation thereof.

Adding up the number of characters in the string can easily be stored in an unsigned type or an unsigned long or even the long long!

Where am I going with all this? Well, my mind is a bit scattered atm but I read this paper on hashing on the GPU and I think the algorithm is simple enough that we can apply it to this situation.

Here's the paper : http://idav.ucdavis.edu/~dfalcant//d...ssertation.pdf

The section of interest is the chaining section because of how it works. The idea here is to use buckets. Each key denotes a bucket by a unique ID (hence hashing the keys themselves). We hash the keys to make a bucket and then we fill the bucket with the proper contents.

I think if we used the simple hash function I posted above, we would be able to find some good partial matches. I'm going to test right now and post back with results. The idea is, for a given key, we hash it and compare it to the bucket IDs and for any buckets in the general vicinity, we check for how well it matches.

**laserlight** · 09-21-2014

Originally Posted by MutantJohn

I feel like the biggest challenge in this problem is searching the keys for substrings or partial matches of what the user enters into a search.

Err... yeah. That is precisely what Vincent Dotts was asking for help on.

Originally Posted by MutantJohn

There's a very simple hash method I learned where you take an int and reverse the bits, adding everything up.

I am not sure what you mean by "reverse the bits, adding everything up". Did you mean "extract the bits, adding everything up"?

Originally Posted by MutantJohn

A char is any one of 256 values. (...) This is good because it's degenerate in the sense that a + b + c will hash to the same value as c + a + b or any permutation thereof.

Adding up the number of characters in the string can easily be stored in an unsigned type or an unsigned long or even the long long!
(...)
I think if we used the simple hash function I posted above, we would be able to find some good partial matches.

You have the same problem in that you have to compute the hash when you insert into the hash table, but you are searching based on prefix after the insertions. So, if you add the value of the characters to get your hash, you then only get matches for those strings that are permutations of the given prefix entered. So, for "abcde", you would also need to compute the hashes for "a", "ab", "abc", "abcd", all mapping to "abcde". Otherwise, if the user enters "abc", you can find various permutations of "abc", but not "abcde".

Originally Posted by MutantJohn

The section of interest is the chaining section because of how it works. (...) Each key denotes a bucket by a unique ID (hence hashing the keys themselves). We hash the keys to make a bucket and then we fill the bucket with the proper contents.

Err... this idea is basically one of the standard collision resolution strategies for hash tables. What makes it worthy of inclusion in that dissertation is that the author investigated how to avoid typical linked list implementations in order to have "Efficient Hash Tables on GPU".

**MutantJohn** · 09-21-2014

Well, I thought the implementation was simple enough it could easily be adapted to a CPU. And it'd be relatively quick too because it's all contiguous in memory. This is one of those rare algorithms where I think it'd run pretty quickly on the CPU as well (avoids branching, plays to cache lines).

And I tried to make a small implementation but I think it failed. The results I get are pretty random and you're right about needing to make hashes for all possible substrings which means the algorithm isn't well-designed. But such is life.

Edit : Again, I only suggested the paper because the algorithm is incredibly simple and performant. It's a good mix of things that would take like a figurative 5 minutes to adapt to one's needs. It was just an idea, is all.

**iMalc** · 09-22-2014

Perhaps you're after the Levenshtein Distance algorithm.
That's the algorithm used to determine suggestions for misspelled words. It gives you a score on how different two strings are.

Perhaps you also want to be able to search for substrings or nearly-substrings, and that complicates it a bit. It would be really cool to modify it such that the strings:
abcdefghijklmnop
and
efhij
gave almost as close a match as
abcdefihjklmnop
That would allow for a very powerful search facility!

**Vincent Dotts** · 09-22-2014

Im guessing your a 49er fan @daved?

but thank you low_bound was better for my program then find but i still have the problem with lower and upper case. if i type in Car it will find Carr but if i type in car it will say not in directory

**Vincent Dotts** · 09-22-2014

@daved Im guessing your a 49er fan?

but thank you low_bound was better for my program then find but i still have the problem with lower and upper case. if i type in Car it will find Carr but if i type in car it will say not in directory

**Daved** · 09-22-2014

>> i still have the problem with lower and upper case
The solution that works best for me is to convert case before storage and before lookup. Making the comparison work by ignoring case is more difficult than just making everything use the same case at the beginning and using the default comparison.

>> Im guessing your a 49er fan?
After the last two weeks I'm reconsidering...

**Vincent Dotts** · 09-22-2014

but for some reason if i just tell it to search M it only pulls up one of them not all three?

But ya they seem like they get penalized every other play.

**Daved** · 09-22-2014

>> but for some reason if i just tell it to search M it only pulls up one of them not all three?
You've got to change the actual strings you're storing in the map, and then convert the user's input before you use it with lower_bound to search the map.

**Daved** · 09-22-2014

>> For your sanity you might want to apply a different approach for some searches

What would the benefit of that approach be? The more general solution works with the same inputs that the code above does. More unnecessary code to maintain seems like it might not be good for one's sanity.

Thread: searching vector using maps

Thread Tools

Search Thread

Display

Thank you

Alright im going to try that @daved

Similar Threads

Maps - referencing individual members numerically (as with a vector)

Iterating through a vector of maps and a map of vectors...

Searching for an int in a vector

searching an element of a vector

Searching the contents of a vector

Tags for this Thread