string indexing..

**underthesun** · 12-16-2009

Well, since there's no forums for algorithms questions..

Actually, I'm in the middle of coding it, but I thought someone might be able to tell me a better way to do this.

Basically, I have a lot of short strings that consist of words such as "adneosine", or "6-hydroxyflavone". I need a system which can index these words so a substring search (let's assume case-insensitive) can yield all the matching words quickly..

What I'm doing is that, for every word, I will generate an int, which is a summary of the letters of the alphabet that is present in that word. For example, if the word is "abed", the int will be "1101100000000....". Then, as a preliminary check on whether a substring a exists in string b, I will check if (a.intsummary & b.intsummary == a.intsummary), which will help me prune out negatives without having to do a lot of calculations.

So yeah, that's the plan (which I'm implementing right now). Will probably be using a 64bit integer so it can also keep numbers and most punctuations.. but yeah. What is this algorithm called (I'm assuming someone already came up with this)? Any better ways?

**MK27** · 12-16-2009

Great idea.

**slingerland3g** · 12-16-2009

When you say 'summary' is this a true representation of your word you are converting to int? When you mask bits 'a.intsummary & b.intsummary' will never equal a.intsummary unless the ints are exact match.

I am thinking that if you are really needing to search up a family of nucleosides or chemical compounds that relate, your best bet is to make use of substr(). Why do all this conversion?

To add, there is also strtok().

Forums here go in to great detail on their usages.

**MK27** · 12-16-2009

Originally Posted by slingerland3g

When you say 'summary' is this a true representation of your word you are converting to int? When you mask bits 'a.intsummary & b.intsummary' will never equal a.intsummary unless the ints are exact match.

Which is what makes it a great idea. If there are many many strings to check against, you will have reduced the number of expensive string matching function calls by a whole lot. So rather than calling strstr on all of them, you just have a set of ints representing the words in that string. If none of them match, the word cannot be in the string, and you do not have to bother with strstr at all.

**Adak** · 12-16-2009

If it's really critically time sensitive, and will be accessed more than 10,000 times, I'd use a hash. A bother to code, but you can't beat it for speed.

Hash function - Wikipedia, the free encyclopedia

For something simpler, I'd sort the words using C's qsort(), and then either do a binary search for any word you are looking for, on that array (or file), or I'd code up an index function (faster than binary searching, but also more work).

All three of the above ways of searching are fast. No slow pokes in the bunch:

1) Fastest - Hash
2) Faster - Index
3) Fast - Binary Search

I'm unsure what the characteristics of your own search would be. It sounds like a homegrown hash technique, which I'd not recommend for a beginner. All that depends on your own skill, and patience in proving it's accurate.

**slingerland3g** · 12-16-2009

Originally Posted by Adak

If it's really critically time sensitive, and will be accessed more than 10,000 times, I'd use a hash. A bother to code, but you can't beat it for speed.

Hash function - Wikipedia, the free encyclopedia

For something simpler, I'd sort the words using C's qsort(), and then either do a binary search for any word you are looking for, on that array (or file), or I'd code up an index function (faster than binary searching, but also more work).

All three of the above ways of searching are fast. No slow pokes in the bunch:

1) Fastest - Hash
2) Faster - Index
3) Fast - Binary Search

I'm unsure what the characteristics of your own search would be. It sounds like a homegrown hash technique, which I'd not recommend for a beginner. All that depends on your own skill, and patience in proving it's accurate.

Yes! I was somewhat thinking along those lines but was uncertain. Though comparing strings and then searching for substrings within each will be the most time consuming. Using strcmp() for this is ugly!

Possible function to study for this very thing using hash techniques. (Partitioned version of qsort I presume)
Rabinâ€“Karp string search algorithm - Wikipedia, the free encyclopedia

**jeffcobb** · 12-16-2009

Sorry to sound like an echo but this is what a hash table is for IMHO. The only thing you need to watch for are collisions. snippets.org has some sample simple hashing routines to get you started.

**slingerland3g** · 12-16-2009

Originally Posted by jeffcobb

Sorry to sound like an echo but this is what a hash table is for IMHO. The only thing you need to watch for are collisions. snippets.org has some sample simple hashing routines to get you started.

Yeah, I have not dealt much with hashing algorithms, but have dealt with many sort and sort partitioning algorithms in my studies. So using hashing was eluding me as a suggestion. My direction was to just sort the list of strings, perhaps using a partitioning sort method if you know what substring matches within a particular key, this will speed up the search quite a bit.

**underthesun** · 12-16-2009

Wait, you mean there's a way to search for "romoolus" when you enter a search string "moo" using hash tables? Sorry, I should have made it clear that I'm looking for indexing for preparing for substring search..

**jeffcobb** · 12-16-2009

Originally Posted by underthesun

Wait, you mean there's a way to search for "romoolus" when you enter a search string "moo" using hash tables? Sorry, I should have made it clear that I'm looking for indexing for preparing for substring search..

I saw "fast string lookup" and jumped to the textbook solution; sorry, seriously. So if you wanted a substring lookup, an iterative substring (Boyer Moore maybe) is all the answer I can come up with. Now if you want something faster (the above is one step above a brute-force search) then maybe you can store the search key as a soundex and then search for a subset of *that* by making a soundex out of the search key if possible. You are still more or less brute-force searching but you will have a more defined subset to search for. You can find a simple soundex algo in the famous snippets.org...

Sorry for the misunderstanding; I am really curious to see what the more experienced folks come up with.

As a side note, I have never heard of the need for this kind of thing in an actual application; may I ask for a but more information on the problem domain?

**underthesun** · 12-17-2009

As a side note, I have never heard of the need for this kind of thing in an actual application; may I ask for a but more information on the problem domain?

I'm building an interactive visual query system for proteomic data (in layman's terms, biological data), and a lot of the data has things that can be searched for. Sometimes the words that can be searched are simple (adenosine, guanine, blah), but sometimes they are complicated (e.g 6-guaninonic-3-pyruberatin-adipose or something like that).

Searching something that starts with pyru- and then being able to refine that search results with further checks is what I'm looking for.

Anyhow, thanks for that boyer-moore info, I'm gonna take a better look

**zacs7** · 12-17-2009

Or if you have many strings you may want to use a suffix trie... Suffix tree - Wikipedia, the free encyclopedia

Plus you'll be saving memory if you compress it

. All your requirements point to a suffix trie, for "matching" it will just depend on which path you take through the trie. Matching xyz- and -xyz is easy, among other things.

Hint: Build the suffix trie, and then compress it. Compressing from the start is rather challenging.

Thread: string indexing..

Thread Tools

Search Thread

Display

string indexing..

Me too post

(at least) my mistake

Similar Threads

[Inheritance Hierarchy] User Input on program with constructors. How ?

We Got _DEBUG Errors

Something is wrong with this menu...

Classes inheretance problem...

Warnings, warnings, warnings?