Thread: searching vector using maps

  1. #16
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    I feel like the biggest challenge in this problem is searching the keys for substrings or partial matches of what the user enters into a search.

    If the code was supposed to take exact and literal text then this problem would be quite simple but as it stands now, I think the biggest obstacle to overcome is how we search for the proper key.

    A char is any one of 256 values. There's a very simple hash method I learned where you take an int and reverse the bits, adding everything up. This is good because it's degenerate in the sense that a + b + c will hash to the same value as c + a + b or any permutation thereof.

    Adding up the number of characters in the string can easily be stored in an unsigned type or an unsigned long or even the long long!

    Where am I going with all this? Well, my mind is a bit scattered atm but I read this paper on hashing on the GPU and I think the algorithm is simple enough that we can apply it to this situation.

    Here's the paper : http://idav.ucdavis.edu/~dfalcant//d...ssertation.pdf

    The section of interest is the chaining section because of how it works. The idea here is to use buckets. Each key denotes a bucket by a unique ID (hence hashing the keys themselves). We hash the keys to make a bucket and then we fill the bucket with the proper contents.

    I think if we used the simple hash function I posted above, we would be able to find some good partial matches. I'm going to test right now and post back with results. The idea is, for a given key, we hash it and compare it to the bucket IDs and for any buckets in the general vicinity, we check for how well it matches.

  2. #17
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    27,503
    Quote Originally Posted by MutantJohn
    I feel like the biggest challenge in this problem is searching the keys for substrings or partial matches of what the user enters into a search.
    Err... yeah. That is precisely what Vincent Dotts was asking for help on.

    Quote Originally Posted by MutantJohn
    There's a very simple hash method I learned where you take an int and reverse the bits, adding everything up.
    I am not sure what you mean by "reverse the bits, adding everything up". Did you mean "extract the bits, adding everything up"?

    Quote Originally Posted by MutantJohn
    A char is any one of 256 values. (...) This is good because it's degenerate in the sense that a + b + c will hash to the same value as c + a + b or any permutation thereof.

    Adding up the number of characters in the string can easily be stored in an unsigned type or an unsigned long or even the long long!
    (...)
    I think if we used the simple hash function I posted above, we would be able to find some good partial matches.
    You have the same problem in that you have to compute the hash when you insert into the hash table, but you are searching based on prefix after the insertions. So, if you add the value of the characters to get your hash, you then only get matches for those strings that are permutations of the given prefix entered. So, for "abcde", you would also need to compute the hashes for "a", "ab", "abc", "abcd", all mapping to "abcde". Otherwise, if the user enters "abc", you can find various permutations of "abc", but not "abcde".

    Quote Originally Posted by MutantJohn
    The section of interest is the chaining section because of how it works. (...) Each key denotes a bucket by a unique ID (hence hashing the keys themselves). We hash the keys to make a bucket and then we fill the bucket with the proper contents.
    Err... this idea is basically one of the standard collision resolution strategies for hash tables. What makes it worthy of inclusion in that dissertation is that the author investigated how to avoid typical linked list implementations in order to have "Efficient Hash Tables on GPU".
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #18
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Well, I thought the implementation was simple enough it could easily be adapted to a CPU. And it'd be relatively quick too because it's all contiguous in memory. This is one of those rare algorithms where I think it'd run pretty quickly on the CPU as well (avoids branching, plays to cache lines).

    And I tried to make a small implementation but I think it failed. The results I get are pretty random and you're right about needing to make hashes for all possible substrings which means the algorithm isn't well-designed. But such is life.

    Edit : Again, I only suggested the paper because the algorithm is incredibly simple and performant. It's a good mix of things that would take like a figurative 5 minutes to adapt to one's needs. It was just an idea, is all.
    Last edited by MutantJohn; 09-21-2014 at 02:19 PM.

  4. #19
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Perhaps you're after the Levenshtein Distance algorithm.
    That's the algorithm used to determine suggestions for misspelled words. It gives you a score on how different two strings are.

    Perhaps you also want to be able to search for substrings or nearly-substrings, and that complicates it a bit. It would be really cool to modify it such that the strings:
    abcdefghijklmnop
    and
    efhij
    gave almost as close a match as
    abcdefihjklmnop
    That would allow for a very powerful search facility!
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  5. #20
    Registered User
    Join Date
    Sep 2014
    Location
    sacramento, CA
    Posts
    6

    Thank you

    Im guessing your a 49er fan @daved?

    but thank you low_bound was better for my program then find but i still have the problem with lower and upper case. if i type in Car it will find Carr but if i type in car it will say not in directory
    Last edited by Vincent Dotts; 09-22-2014 at 03:12 PM.

  6. #21
    Registered User
    Join Date
    Sep 2014
    Location
    sacramento, CA
    Posts
    6
    @daved Im guessing your a 49er fan?

    but thank you low_bound was better for my program then find but i still have the problem with lower and upper case. if i type in Car it will find Carr but if i type in car it will say not in directory

  7. #22
    Registered User
    Join Date
    Jan 2005
    Posts
    7,364
    >> i still have the problem with lower and upper case
    The solution that works best for me is to convert case before storage and before lookup. Making the comparison work by ignoring case is more difficult than just making everything use the same case at the beginning and using the default comparison.

    >> Im guessing your a 49er fan?
    After the last two weeks I'm reconsidering...

  8. #23
    Registered User
    Join Date
    Sep 2014
    Location
    sacramento, CA
    Posts
    6

    Alright im going to try that @daved

    but for some reason if i just tell it to search M it only pulls up one of them not all three?

    But ya they seem like they get penalized every other play.

  9. #24
    Registered User
    Join Date
    Jan 2005
    Posts
    7,364
    >> but for some reason if i just tell it to search M it only pulls up one of them not all three?
    You've got to change the actual strings you're storing in the map, and then convert the user's input before you use it with lower_bound to search the map.

  10. #25
    Registered User
    Join Date
    Jan 2005
    Posts
    7,364
    >> For your sanity you might want to apply a different approach for some searches

    What would the benefit of that approach be? The more general solution works with the same inputs that the code above does. More unnecessary code to maintain seems like it might not be good for one's sanity.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 4
    Last Post: 02-10-2014, 11:07 AM
  2. Iterating through a vector of maps and a map of vectors...
    By alaa_137 in forum C++ Programming
    Replies: 7
    Last Post: 08-18-2012, 08:46 AM
  3. Searching for an int in a vector
    By go_loco in forum C++ Programming
    Replies: 14
    Last Post: 04-22-2010, 06:11 PM
  4. searching an element of a vector
    By strickey in forum C++ Programming
    Replies: 4
    Last Post: 02-15-2005, 08:29 AM
  5. Searching the contents of a vector
    By Sparkle1984 in forum C++ Programming
    Replies: 4
    Last Post: 10-16-2003, 07:14 AM

Tags for this Thread