Thread: Word Searching with an Index

  1. #1
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868

    Word Searching with an Index

    This is a self indexing word searcher program, designed to show the basics of the logic needed. Lots of useful bits of code for a working program, are not included (like good memory usage), guarded user input, etc.

    This is not well tested code, but I wanted to post it before the other word searching thread was too far gone off the first page of the forum.

    Comments always welcome.

    The effect of the indexing, (a basic one level indexing), is that the search is restricted from the entire word list, to only searching through one letter's words. Search for a word beginning with 'a', and you only have to search from "a" to "azure".

    You can extend the indexing levels to as many as you want. The greater the cost of each comparison, the more valuable indexing becomes to the search. There is a more advanced technique for working with HD's, that go beyond the scope of this program.

    The 29k+ word file appropriate for this program, is here, but any wordlist with short words only (less than 9 char's), is fine. This works only with lowercase words, presently.

    swoopshare - Download wordList.txt

    Code:
    /* 
    This is a word searching program, but it's text based, instead 
    of binary. It's purpose is educational, rather than utilitarian.
    
    It shows how to automatically build an index array, and how that 
    index array can then be used to cut down the search space, by a 
    huge amount, using very little time.
    
    It is designed to search through a large wordlist.txt file, with each
    word on it's own row, e.g.:
    
    a
    aardvark
    aback
    abacus
    etc.
    
    The list must be alphabetized and, for this example program, the
    words must have less than 9 letters, and lowercase letters only. 
    
    The total wordList.txt file must have less than 30,000 words.
    
    The words array wastes a lot of memory, but saving memory isn't what
    this program is trying to show. I did not want to distract from it's
    purpose of showing an indexed search.
    
    This version simply looks up one word given by the user. Your version
    might search for every word in a text file you name, count the 
    number of searches made, and time the entire run.
    
    I have uploaded the wordlist I used, to Swoopshare at:
    
    http://www.swoopshare.com/file/8ef2c3aab1cee0a2f0199885dade310f/wordList.txt.html
    
    but any sorted wordlist with lowercase words less than 9 letters, will be fine.
    
    --Adak
    
    
    Note: a newline is two chars: CR and LF, with ascii values of 13 
    and 10, in text files.
    
    Created on the Pelles C (free) Windows C compiler.
    */
    #include <stdio.h>
    #include <string.h>
    
    #define MAX 11
    
    char words[29300][MAX];
    
    int binarySearch(const char *, int, int);
    void getInput(char *);
    
    int main(void) {
       int i, idx, n, lo, hi;
       FILE *fp;
       int index[27]={0};
       char goal[MAX];
       char testchar;
       
    
       if((fp=fopen("wordList.txt", "r")) == NULL) {
          perror("Error: unable to open the wordlist.txt file");
          return 0;
       }
       printf("\nLoading words\n"); 
    
       //fill the index array
       testchar='a'-1;
       n = 0; i = idx = 0;
       while((fgets(words[n], MAX, fp)) != NULL) {
          if((words[n][0]) > testchar) {  
             testchar = words[n][0];
             index[testchar-'a'] = n;  //char - 'a' makes a=0, b=1, c=2, etc.
          }
          ++n;  //line counter
       }
       index[26] = n-1;
       fclose(fp);
       printf("\nThe Index Array Table:\nLetter\tStarts At Line #\n"); //show the index array
       printf("======================\n");
       for(i=0;i<26;i++) {
          printf(" %c: %10d\n", (i+'a'), index[i]);
       }
        
       //get user input
       getInput(goal);
    
       /* This is the big benefit of using an index. Only one index level is used here, but you can make as 
          many levels as you need, using the second, third, and etc., letter of the goal word.
       */
       //set the lo and hi indices
       i = goal[0]-'a';  //<-- 
       printf("\n lowest word in the search: %shighest word in the search: %s\n", words[index[i]], words[index[i+1]-1]); 
       //getchar();
       
       lo = index[i]; hi = index[i+1];
       idx=binarySearch(goal, lo, hi);
       if(idx > -1) {
          printf(" The word: %swas found: %sat index#: %d \n\n",\
          goal, words[idx], idx);
       }else  
          printf("The word: %swas not found\n\n", goal);
    
       return 0;
    }
    int binarySearch(const char goal[], int lo, int hi) {
       int mid;
    
       while (lo <= hi) {
          mid = (lo + hi) / 2;
          /* watch how the binary search zerooes in on the goal word  */
          printf("goal: %s mid: %d  words[mid]: %s", goal, mid, words[mid]); 
          printf("stringcmp returns: %d\n", strcmp(goal, words[mid]));
          getchar();
          if (strcmp(goal, words[mid]) < 0)     //goal < words[mid]: -1
             hi = mid-1;
          else if(strcmp(goal, words[mid]) > 0) //goal > words[mid]:  1 
             lo = mid+1;
          else
             return mid;
       }
       return -1;
    }
    /* needs guard code to stop empty input from crashing it, etc.*/
    void getInput(char goal[]) {
       /* MAX-3 will be a CR, MAX-2 will be LF, and MAX-1 will be the end of 
          string char '\0', on the longest words that fit properly into the array 
       */
       printf("\nEnter a word to be searched for, up to %d letters: ", MAX-3);
       fgets(goal, MAX, stdin);  //we need to keep the newline
       printf("String Length is %d\n", strlen(goal));
    
       /* kept the newlines, because the words in the 
          list, all have newlines included at the end of them
       */
    
    }

  2. #2
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    This is a new version (0.01), of the Index word searching program, above. It has several enhancements, among them it's ability to find and search for all the words, in an entire text, instead of from user input of only one word.

    Other changes and new files for it to work with and on, are noted just below.

    Code:
    /* 
    This is a word searching program, but it's text based, instead 
    of binary. It's purpose is educational, rather than utilitarian.
    
    It shows how to automatically build an index array, and how that 
    index array can then be used to cut down the search space.
    
    The indices are passed on to the binary search, to narrow the range 
    of words that the binary search must handle.
    
    It is designed to work with a large wordlist.txt file, with each
    word on it's own row, e.g.:
    
    a
    aardvark
    aback
    abacus
    etc.
    
    The list must be alphabetized and, for this example program, the
    words must have less than 16 letters, and have lowercase letters only. 
    
    The total wordList.txt file must have less than 39,200 words.
    
    The words array wastes a lot of memory, but saving memory isn't what
    this program is trying to show. I did not want to distract from it's
    purpose of showing an indexed search.
    
    This version finds every word in a text file, and searches for it, 
    giving just a summary of the words found and not found, and 
    displaying the elapsed time of the program's run. 
    
    
    Note: This is version 0.1 
    
    The changes include:
    
    1) All newlines have been removed from the words
    
    2) Input is from a file, rather than from the user. The file is the Gettysburg
       Address, right now. It is available all over the net, but you can also get it
       from Swoopshare here:
    
    http://www.swoopshare.com/file/430710f78c16fee9096e92dbefd80c6f/Gettysburg.txt.html
    
    3) The run is timed by the program. 
    
    4) Educational message output is commented out from the active code.
    
    5) A larger wordlist.txt file is used, and longer words can be searched for. The wordlist file
       has been uploaded to Swoopshare at:
    
    http://www.swoopshare.com/file/c1d80e4e64ad601a30b3e2fdbcf09ed0/wordList.txt.html
    
    6) The first letter of the word being searched for, is changed from uppercase to lowercase, as needed.
    
    7) The display is slightly more organized.
    
    --Adak
    
    Created on the Pelles C (free) Windows C compiler.
    */
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    #include <time.h>
    
    #define MAX 17 
    
    //longest word is 15 chars
    
    char words[39200][MAX];
    
    int binarySearch(const char *, int, int);
    int getInput(char *, int);
     
    
    int main(void) {
       int c, i, idx, n, lo, hi, len, inWord=0, found, notFound;
       FILE *fp, *fpget;
       int index[27]={0};
       char goal[MAX];
       char testchar;
       clock_t start, stop;
    
       if((fp=fopen("wordlist.txt", "r")) == NULL) {
          perror("Error: unable to open the wordlist.txt file");
          return 0;
       }
       printf("\nLoading words\n"); 
    
       start = clock();
       //fill the index array
       testchar='a'-1;
       n = 0; i = 0;
       while((fgets(words[n], MAX, fp)) != NULL) {
          len = strlen(words[n]);
          if(words[n][len-1]=='\n') //remove the newline
             words[n][len-1]='\0';
    
          if((words[n][0]) > testchar) {  
             testchar = words[n][0];
             index[testchar-'a'] = n;  //char - 'a' makes a=0, b=1, c=2, etc.
          }
          ++n;  //line counter
       }
       index[26] = n-1;
       fclose(fp);
       printf("\nThe Index Array Table:\nLetter\tStarts At Line #\n"); //show the index array
       printf("======================\n");
       for(i=0;i<26;i++) {
          printf(" %c: %10d\n", (i+'a'), index[i]);
       }
       putchar('\n'); 
       //get user input
       if((fpget=fopen("Gettysburg.txt", "r")) == NULL) { 
           perror("Error: Unable to open the Gettysburg.txt file\n");
          return 0;
       }
       //get a word from the text file
       i=found=notFound=0;
       while((c=fgetc(fpget)) != EOF) {
          if(isalpha(c)) {
             inWord=1;
             goal[i]=c;
             //printf("%d: %c ", i, goal[i]); getchar();
             ++i;
          }
          else {
             goal[i]='\0';
             //putchar('\n');
             if(inWord) {
                inWord=0;
             
               //printf("  goal: %s \n", goal); getchar();
               /* This is the benefit of using an index. Only one index level is used here, but you can make as 
                  many levels as you need, using the second, third, and etc., letter of the goal word.
               */
    
          //set the first letter to lowercase
          goal[0]=tolower(goal[0]);      
          //set the lo and hi indices
          i = goal[0]-'a';  //<-- 
          //printf("\n lowest word in the search: %s\nhighest word in the search: %s\n\n", words[index[i]], words[index[i+1]-1]); 
          lo = index[i]; hi = index[i+1]-1;
          idx=binarySearch(goal, lo, hi);
          if(idx > -1) {
             //printf(" The word: %s was found: %s at index#: %d \n\n", goal, words[idx], idx);
             ++found;
          }else  {
             printf("The word: %s was not found\n", goal);
             ++notFound;
          }
          goal[0]='\0';
          i=0;       
          //getchar();
             }
          }
       }
       fclose(fpget);
       stop=clock();
       printf("\n\n %d words were found  and %d words were not found in the wordlist\n\n", found, notFound);
       printf("Elaspsed time for this run: %.3lf seconds.\n", (double)(stop-start)/CLOCKS_PER_SEC);
       return 0;
    }
    
    int binarySearch(const char goal[], int lo, int hi) {
       int mid;
    
       while (lo <= hi) {
          mid = (lo + hi) / 2;
          /* watch how the binary search zerooes in on the goal word  */
      
          //printf("goal: %s mid: %d  words[mid]: %s  ", goal, mid, words[mid]); 
          //printf("stringcmp returns: %d\n", strcmp(goal, words[mid]));
          //getchar();
          if (strcmp(goal, words[mid]) < 0)     //goal < words[mid]: -1
             hi = mid-1;
          else if(strcmp(goal, words[mid]) > 0) //goal > words[mid]:  1 
             lo = mid+1;
          else
             return mid;
       }
       return -1;
    }
       /* needs guard code to stop empty input from crashing it, etc.
          In this version, this function is never called.
       */
    
    int getInput(char goal[], int i) {
       int len;
       
       if(i) {
          printf("\nEnter a word to be searched for, up to %d letters: ", MAX-2);
          fgets(goal, MAX, stdin);  
          len = strlen(goal);
          if(goal[len-1]=='\n')
             goal[len-1]='\0';
          printf("String Length is %d\n", strlen(goal));
       }
       return 0;
    }

  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Using a slightly expanded wordList file, the Word Searcher result of running it on The Art of War by Sun Tzu:
    Attached Images Attached Images Word Searching with an Index-textof_theartofwar-png 

  4. #4
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    You could replace:
    Code:
          len = strlen(words[n]);
          if(words[n][len-1]=='\n') //remove the newline
             words[n][len-1]='\0';
    With:
    Code:
    words[n][ strspn( words[n], "\n" ) ] = '\0';

    Quzah.
    Last edited by quzah; 10-21-2011 at 06:07 PM.
    Hope is the first step on the road to disappointment.

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Thanks Quzah, and that's a new function to me.

    I tried your code, but it didn't work.
    Code:
    words[n][strspn ( words[n], "\n" ) ] = '\0';  //cut and paste of what I tried
    Do I need some extra parenthesis in this expression?
    Last edited by Adak; 10-21-2011 at 08:19 PM.

  6. #6
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    strspn(3): search string for set of char - Linux man page

    I think you want strcspn. That will give you the length of all the contiguous non-newline chars, which should correspond to the index of the first newline. Personally, I prefer the strchr version:
    Code:
    char *p = strchr(words[n], '\n');
    if (p)
        *p = '\0';

  7. #7
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by anduril462 View Post
    That's the one I was thinking of. I don't use either one of those that often, and all I could think of was strspn.


    Quzah.
    Hope is the first step on the road to disappointment.

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    @Thanks Anduril, that's another new function to learn about.

    I've extended the word search up from one word, to entire files. Today, I downloaded "Kidnapped" by RL Stevenson, and "A Tale of Two Cities", by Dickens, from the Guttenberg project, and ran them through the word searcher program.

    First result was from using the Index logic. The second result was from using only regular binary search. This was the result:
    Word Searching with an Index-compareindex2binary-png

    Indexing in this algorithm, makes no sense, when there is almost no cost for making the comparisons to the array words. Yes, indexing still makes fewer comparisons, but since the cost is practically nothing, why would you use indexing?

    I could add another layer of two of indexing to the program, but it would offer no improvement, as long as the comparisons with the words in the array, can be done at almost no cost. The benefit of indexing showed clearly with small drive buffers and it's resulting higher cost comparisons (slower memory access, slower cpu's with smaller cache's, etc.), has sharply diminished.

    I thought about putting the words all into a second array, and THEN conducting all the word searches, since that would eliminate the interlude of getting each word, which this program uses. I decided against it because

    1) I like the utility of naming the text file, and letting it run, despite any memory problems caused by a large book, and

    2) The run time for an entire novel is less than 1/10th of a second on a good PC.

    That's fast enough for me.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Searching word(string) in matrix.
    By Aleremilcapo in forum C Programming
    Replies: 7
    Last Post: 09-22-2010, 09:41 PM
  2. Searching a specific string/word/etc in a text file?
    By zacharyrs in forum C Programming
    Replies: 5
    Last Post: 11-29-2009, 07:54 PM
  3. searching a text file for a word
    By satory in forum C Programming
    Replies: 5
    Last Post: 02-22-2005, 01:04 PM
  4. Help with searching word lists
    By denizengt in forum C Programming
    Replies: 3
    Last Post: 10-02-2003, 02:03 PM
  5. Replies: 2
    Last Post: 01-02-2002, 02:05 PM