Thread: string indexing..

  1. #1
    Registered User
    Join Date
    Jan 2005
    Posts
    108

    string indexing..

    Well, since there's no forums for algorithms questions..

    Actually, I'm in the middle of coding it, but I thought someone might be able to tell me a better way to do this.

    Basically, I have a lot of short strings that consist of words such as "adneosine", or "6-hydroxyflavone". I need a system which can index these words so a substring search (let's assume case-insensitive) can yield all the matching words quickly..

    What I'm doing is that, for every word, I will generate an int, which is a summary of the letters of the alphabet that is present in that word. For example, if the word is "abed", the int will be "1101100000000....". Then, as a preliminary check on whether a substring a exists in string b, I will check if (a.intsummary & b.intsummary == a.intsummary), which will help me prune out negatives without having to do a lot of calculations.

    So yeah, that's the plan (which I'm implementing right now). Will probably be using a 64bit integer so it can also keep numbers and most punctuations.. but yeah. What is this algorithm called (I'm assuming someone already came up with this)? Any better ways?

  2. #2
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Great idea.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  3. #3
    Registered User slingerland3g's Avatar
    Join Date
    Jan 2008
    Location
    Seattle
    Posts
    603
    When you say 'summary' is this a true representation of your word you are converting to int? When you mask bits 'a.intsummary & b.intsummary' will never equal a.intsummary unless the ints are exact match.

    I am thinking that if you are really needing to search up a family of nucleosides or chemical compounds that relate, your best bet is to make use of substr(). Why do all this conversion?


    To add, there is also strtok().

    Forums here go in to great detail on their usages.
    Last edited by slingerland3g; 12-16-2009 at 10:29 AM.

  4. #4
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by slingerland3g View Post
    When you say 'summary' is this a true representation of your word you are converting to int? When you mask bits 'a.intsummary & b.intsummary' will never equal a.intsummary unless the ints are exact match.
    Which is what makes it a great idea. If there are many many strings to check against, you will have reduced the number of expensive string matching function calls by a whole lot. So rather than calling strstr on all of them, you just have a set of ints representing the words in that string. If none of them match, the word cannot be in the string, and you do not have to bother with strstr at all.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    If it's really critically time sensitive, and will be accessed more than 10,000 times, I'd use a hash. A bother to code, but you can't beat it for speed.

    Hash function - Wikipedia, the free encyclopedia

    For something simpler, I'd sort the words using C's qsort(), and then either do a binary search for any word you are looking for, on that array (or file), or I'd code up an index function (faster than binary searching, but also more work).

    All three of the above ways of searching are fast. No slow pokes in the bunch:

    1) Fastest - Hash
    2) Faster - Index
    3) Fast - Binary Search

    I'm unsure what the characteristics of your own search would be. It sounds like a homegrown hash technique, which I'd not recommend for a beginner. All that depends on your own skill, and patience in proving it's accurate.

  6. #6
    Registered User slingerland3g's Avatar
    Join Date
    Jan 2008
    Location
    Seattle
    Posts
    603
    Quote Originally Posted by Adak View Post
    If it's really critically time sensitive, and will be accessed more than 10,000 times, I'd use a hash. A bother to code, but you can't beat it for speed.

    Hash function - Wikipedia, the free encyclopedia

    For something simpler, I'd sort the words using C's qsort(), and then either do a binary search for any word you are looking for, on that array (or file), or I'd code up an index function (faster than binary searching, but also more work).

    All three of the above ways of searching are fast. No slow pokes in the bunch:

    1) Fastest - Hash
    2) Faster - Index
    3) Fast - Binary Search

    I'm unsure what the characteristics of your own search would be. It sounds like a homegrown hash technique, which I'd not recommend for a beginner. All that depends on your own skill, and patience in proving it's accurate.

    Yes! I was somewhat thinking along those lines but was uncertain. Though comparing strings and then searching for substrings within each will be the most time consuming. Using strcmp() for this is ugly!

    Possible function to study for this very thing using hash techniques. (Partitioned version of qsort I presume)
    Rabinā€“Karp string search algorithm - Wikipedia, the free encyclopedia

  7. #7
    Registered User jeffcobb's Avatar
    Join Date
    Dec 2009
    Location
    Henderson, NV
    Posts
    875

    Me too post

    Sorry to sound like an echo but this is what a hash table is for IMHO. The only thing you need to watch for are collisions. snippets.org has some sample simple hashing routines to get you started.
    C/C++ Environment: GNU CC/Emacs
    Make system: CMake
    Debuggers: Valgrind/GDB

  8. #8
    Registered User slingerland3g's Avatar
    Join Date
    Jan 2008
    Location
    Seattle
    Posts
    603
    Quote Originally Posted by jeffcobb View Post
    Sorry to sound like an echo but this is what a hash table is for IMHO. The only thing you need to watch for are collisions. snippets.org has some sample simple hashing routines to get you started.

    Yeah, I have not dealt much with hashing algorithms, but have dealt with many sort and sort partitioning algorithms in my studies. So using hashing was eluding me as a suggestion. My direction was to just sort the list of strings, perhaps using a partitioning sort method if you know what substring matches within a particular key, this will speed up the search quite a bit.

  9. #9
    Registered User
    Join Date
    Jan 2005
    Posts
    108
    Wait, you mean there's a way to search for "romoolus" when you enter a search string "moo" using hash tables? Sorry, I should have made it clear that I'm looking for indexing for preparing for substring search..

  10. #10
    Registered User jeffcobb's Avatar
    Join Date
    Dec 2009
    Location
    Henderson, NV
    Posts
    875

    (at least) my mistake

    Quote Originally Posted by underthesun View Post
    Wait, you mean there's a way to search for "romoolus" when you enter a search string "moo" using hash tables? Sorry, I should have made it clear that I'm looking for indexing for preparing for substring search..
    I saw "fast string lookup" and jumped to the textbook solution; sorry, seriously. So if you wanted a substring lookup, an iterative substring (Boyer Moore maybe) is all the answer I can come up with. Now if you want something faster (the above is one step above a brute-force search) then maybe you can store the search key as a soundex and then search for a subset of *that* by making a soundex out of the search key if possible. You are still more or less brute-force searching but you will have a more defined subset to search for. You can find a simple soundex algo in the famous snippets.org...

    Sorry for the misunderstanding; I am really curious to see what the more experienced folks come up with.

    As a side note, I have never heard of the need for this kind of thing in an actual application; may I ask for a but more information on the problem domain?
    Last edited by jeffcobb; 12-16-2009 at 07:23 PM. Reason: better defined solution
    C/C++ Environment: GNU CC/Emacs
    Make system: CMake
    Debuggers: Valgrind/GDB

  11. #11
    Registered User
    Join Date
    Jan 2005
    Posts
    108
    As a side note, I have never heard of the need for this kind of thing in an actual application; may I ask for a but more information on the problem domain?
    I'm building an interactive visual query system for proteomic data (in layman's terms, biological data), and a lot of the data has things that can be searched for. Sometimes the words that can be searched are simple (adenosine, guanine, blah), but sometimes they are complicated (e.g 6-guaninonic-3-pyruberatin-adipose or something like that).

    Searching something that starts with pyru- and then being able to refine that search results with further checks is what I'm looking for.

    Anyhow, thanks for that boyer-moore info, I'm gonna take a better look

  12. #12
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    Or if you have many strings you may want to use a suffix trie... Suffix tree - Wikipedia, the free encyclopedia

    Plus you'll be saving memory if you compress it . All your requirements point to a suffix trie, for "matching" it will just depend on which path you take through the trie. Matching xyz- and -xyz is easy, among other things.

    Hint: Build the suffix trie, and then compress it. Compressing from the start is rather challenging.
    Last edited by zacs7; 12-17-2009 at 12:48 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 8
    Last Post: 04-25-2008, 02:45 PM
  2. We Got _DEBUG Errors
    By Tonto in forum Windows Programming
    Replies: 5
    Last Post: 12-22-2006, 05:45 PM
  3. Something is wrong with this menu...
    By DarkViper in forum Windows Programming
    Replies: 2
    Last Post: 12-14-2002, 11:06 PM
  4. Classes inheretance problem...
    By NANO in forum C++ Programming
    Replies: 12
    Last Post: 12-09-2002, 03:23 PM
  5. Warnings, warnings, warnings?
    By spentdome in forum C Programming
    Replies: 25
    Last Post: 05-27-2002, 06:49 PM