Search question

**Aerie** · 12-27-2004

I was thinking about search functions, and how they work, and while I imagine this has been done before, I want to know if it's a good idea:

Code:

search(char target[], char strtofind[], int stfsiz)
{
int i = 0;
int dectr;
while((i+stfsiz) <= MAXTARGETSIZE && target[i] != '\0'){
	if(target[i] == strtofind[0]){
		dectr = stfsiz;
		while(dectr>0){
			if(target[i+dectr] == strtofind[dectr]){
				if(dectr==0)
					return i;
				else
					dectr--;
				}
			else {
				i++;
				break;
				}
			}
		}
	else
		i++;
	}
return -1;
}

The logic here is that if the string you're searching for in the target is very long, it's probably a good idea to start at the end of it and 'search back' from where the end of the string being searched for would be, assuming i is the beginning of a match. This works on the assumption that it's more likely that you'll run into divergence between a match and a non-match further from the beginning of them than closer.

IE., I am assuming that while searching for, for example, "mushroom" in an arbitrary string of letters, that I'm less likely to encounter an m and then an o and another o while searching backward from the i+7th char than I will be to find an u, s, and h counting forward.

I know there are better ways to impliment this, but hopefully this will demonstrate the idea I have without making me have to work too much on an idea I'm not even sure is valid.

**sean** · 12-27-2004

What exactly are you trying to improve? Accuracy? Speed and Efficiency? In either case, I don't see how searching backwards would aid you. You still need to see the whole string no matter what.

**Aerie** · 12-28-2004

I'm trying to increase the speed at which a non-hit will be detected.

Basically, the answer to the question I have is mostly statistical, since it deals with the chance a stream of data will diverge/reconverge with the string being searched for.

So maybe I should go find a statistics board and ask there...

**Prelude** · 12-28-2004

>I'm trying to increase the speed at which a non-hit will be detected.
Just for fun, try something like this:

Code:

loop over each character
  if length(text[i]) < length(match)
    return not found
  if text[i] == match[0] and text[i + length(match)] == match[length(match)]
    do thorough test

If the first and last characters of both the text and search string match, what's the probability that it'll be a hit? What if the first, last, and middle characters match? The trick for optimizing for search misses is to minimize how many comparisons you make. This is more difficult for general search algorithms because you can't assume anything about the data. But it's an entertaining exercise, if nothing else.

**Aerie** · 12-28-2004

Prelude:
That's pretty much how I was operating; assuming that as long as you keep the number of operations per comparison low, and distribute the location of the comparisons such that they're at least not completely linear, that some performance would be gained...

The code you just wrote out was actually nearly identical to another idea I had, though I was considering extending it to perform a rotating search from the 'middle out' of the target section of the array. I think I might enjoy writing search functions once I get a better grip on C overall.

**Jez** · 12-28-2004

Hi Aerie,

I think I see what your saying. Is it to do with spatial locality of the data set?

**jEssYcAt** · 12-28-2004

The problem with not being able to make assumptions about your data (such as alphabetical order, most large words are near the end, etc.) is that whether you start your search at the beginning or start from some place else, you still have to go through the same number of comparisons for any given search. Given a data set:

C F A D E B G

if you are looking for 'A' and your algorithm starts in the middle you still have to do at least 2 comparisons. Looking for 'G' would still require you to compare each item in the list.

**wildpenguin** · 12-29-2004

May I suggest more generic linear time algorithms such as KMP? Tricks such as starting in the middle and starting at the end are not great assumptions to make at mentioned previously.

KMP stands for Knuth-Morris-Parret. (I could have mispelled this!)
Try Googling. :-)

I believe string searching takes linear time minimum. (i.e. there's no magic algorithm that will let you find it without searching through the whole text at least once in the worst case.)

**Salem** · 12-29-2004

"boyer moore" is another fast string searching algorithm (just for interest)

**Aerie** · 12-29-2004

Of course you cannot impliment a search algo that does not check the length of the search string, but there are surely ways to, without compromising that stipulation, compare all pertinent elements in a manner that will detect non-hits sooner.

Almost all search scenarios have the one reliable condition of their target data sets be that the 'hits' will outnumber the 'non-hits'.

In other words, it does not really matter so much most of the time how long it takes to assure yourself that your potential match is actually a match, as long as this time penalty is offset by your efficiency in eliminating all non-hits that do not share a high level of similarity with the search string, assuming the target does not contain an inordinately high number of these 'almost-hits'.

**Aerie** · 12-29-2004

PS. I also will take a look at these algos mentioned, maybe they will give me some ideas(or show me how poor my current ones are).

**wildpenguin** · 01-01-2005

I just mentioned KMP because...

(1) KMP is proved mathematically! :-) (I'm a computer science major)

(2) Yes, it is true that you can try to turn up non-matchs/hits sooner, but how will you prove that this is the case for all general cases?

I believe it took years and years to come up with a string matching algorithm. Probably, a new algorithm will be welcomed, but the real difficulty is proving that it works.

Thread: Search question

Thread Tools

Search Thread

Display

Search question

Similar Threads

Attempt to write function search()

linked list search question..

searching and insertion in a binary search tree

Firefox and Google Search

Simple search program