Thread: I split a sentence into a word per line, now I want to compare word to a word in file

  1. #16
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    On indexed searching:

    This is a way to speed up the search for a matching word. You may not want or need it now, but it's good to know it's possible.

    I call this an indexed search and it is faster than the normal binary search. It can be used as an "extra" with the binary search, to make it smarter and faster, or it can be used "full strength" with no binary search portion. The version I'll describe here, is the "extra" one.

    Let's say you have 100,000 words, and you're able to put them all into one array. (This works even better with a file, but we'll use an array for the example.)

    Normally a binary search would begin with midpoint (mid) = (0 + 100,000) / 2. That leaves us to start at index 50,000, and we may need as many as 16 more loops to find out if we have the word we're searching for or not. That's 16 possible string comparisons, which are much slower than a simple integer comparison.

    Index searching to the rescue!

    You'll have a small array of structs with two struct members: "low" and "high", both integers. Call the struct "go". The small array is made up by your program, when it first starts. As the words are loading into the word array, it counts how many words there are, and records the word number, every time the first letter of the word list, changes.

    Now your little go array, has 'a' words starting at 0 index, and stopping at index 4558 (whatever).

    When you go to search for "apple", your binary search won't start at 'k', because the index shows that go[0].low == 0, and that go[0].high == 4558. Now your binary search will be limited to indices 0 to 4558. Your search will begin then, at 2279.

    With just a small number of comparing with one letter, you've removed over 95,000 words from the search space. With luck, and tweaking, you can set it up so you are fitting what you need into fast cache memory, and not causing page or cache memory flushing.

    This is most impressive when the cost (in time), of comparisons is higher, like a slow disk or network access. It can also be optimized more, if you need it.

    Note: you can quickly find the right index for go, by using the ASCII values of the letter. Say the first letter of the word is 'A', consider that when 'A' is lower cased, then:

    go[firstLetter - 'a'] == index 0 (which is the index to that first letter, in the go array)
    go['z' - 'a'] == index 26 (the index for "Zanzibar").
    etc.

  2. #17
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    That code I posted before was wrong, so here is the code that can split large arrays:

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
     
    int main ()
    {
      char *b = malloc(sizeof(char)*1000);
      char * pch;
      FILE * fp;
      fp = fopen("worklist.txt", "r");
      /* insert this */
      if(!fp) {  //if fp was given NULL by fopen() meaning the file was not opened
          perror("Error: file worklist.txt was not found or opened");
          return 0;
      }
      printf("Input a short sentence: ");
      gets(b);
      pch = strtok (b," ,.-");
      while (pch != NULL)
      {
        printf ("%s\n",pch);
        pch = strtok (NULL, " ,.-");
      }
      getch();
      return 0;
    }
    In my txt file I have 'dog n' and nothing else, but how to use strcmp to open the file and print 'dog n' to the console screen when all I type in is 'dog' I still don't know. But you two have given me a lot of good advice I can use when I thoroughly understand C, I'm still reading the book Commontater gave me here a while back.
    Thanks so far though.

  3. #18
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    This code works if I have only 'dog n' in the txt file.

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
     
    int main ()
    {
      char *b = malloc(sizeof(char)*1000);
      char *filedata = malloc(300);
      char * pch;
      FILE * fp;
      fp = fopen("worklist.txt", "r");
      /* insert this */
      if(!fp) {  //if fp was given NULL by fopen() meaning the file was not opened
          perror("Error: file worklist.txt was not found or opened");
          return 0;
      }
      printf("Input a short sentence: ");
      gets(b);
      pch = strtok (b," ,.-");
      while (pch != NULL)
      {
        
      if(strcmp(pch, "dog") == 0){
        // Set pointer to beginning of file:
        fseek( fp, 0L, SEEK_SET );
        // Read data back from file:
        fscanf(fp, "%[^\n]", filedata); // or just use fgets
        printf("%s\n",filedata);
        }
        pch = strtok(NULL, " ,.-");
      }
      getch();
      return 0;
    }
    When I have more lines above 'dog n' in the txt file those are printed instead.

  4. #19
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    You're rewinding the file pointer, instead of just printing from where it is:
    Code:
      if(strcmp(pch, "dog") == 0){
    
        // Set pointer to beginning of file:  <--DON'T DO THIS!
        fseek( fp, 0L, SEEK_SET );
        // Read data back from file:
        fscanf(fp, "%[^\n]", filedata); // or just use fgets
        printf("%s\n",filedata);
        }
        pch = strtok(NULL, " ,.-");
      }
    Read in your target word (the one you're searching for), into a char array, instead of using a literal like "dog". And don't use fseek() to reset the file pointer back to the beginning of the file, when you find the target word.

    My simple idea was to read in each word in the search space (making the search space as small as you need), comparing it with strcmp(). When strcmp() returns 0, you have found it.

    You won't need strtok anymore, either. Just change your fscanf() to:

    Code:
    fscanf(fp, "%s %c ", word, &wrdtype);
    where word is a char array, and type is a char. Note the spaces in the format string, very carefully. I would leave the hyphen out of the word list, since it serves no useful purpose, and is a harder to type key on the top row of the keyboard. Just a quick and easy space is fine.

    For any lines that you have already entered a hyphen into, just replace it using the replace function in your editor.

    dog n

    instead of
    dog-n

    or
    dog - n

    dog n is simplest text representation and fastest to enter and search.

  5. #20
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    Here is the code I have so far:

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
     
    int main ()
    {
      char *b = malloc(sizeof(char)*1000);
      char *filedata = malloc(300);
      char * pch;
      FILE * fp;
      fp = fopen("worklist.txt", "r");
      /* insert this */
      if(!fp) {  //if fp was given NULL by fopen() meaning the file was not opened
          perror("Error: file worklist.txt was not found or opened");
          return 0;
      }
      printf("Input a short sentence: ");
      gets(b);
      pch = strtok (b," ,.-");
      while (pch != NULL)
      {
      if(strcmp(pch, "dog") == 0){
      // Set pointer to beginning of file:
      fseek( fp, 0L, SEEK_SET );
      // Read data back from file:
      // fscanf(fp, "%[^\n]", filedata);
      while(fgets(filedata, 300, fp))
        {
        if(memcmp(pch, filedata, strlen(pch)) == 0) break;
        else strcpy(filedata, "Not a match!");
        }
        printf("%s\n",filedata);
        }
        pch = strtok (NULL, " ,.-");
      }
      getch();
      return 0;
    }
    If the txt file does not list 'dog n' at the top of the list I still see 'dog n' in the console window, so the code works as intended.

    I don't have anything else to do really, but I would appreciate it if you posted the entire code your talking about, otherwise I'm happy with what I have. Thank you both for all the help you gave me.

  6. #21
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    Quote Originally Posted by Adak View Post
    Read in your target word (the one you're searching for), into a char array, instead of using a literal like "dog". And don't use fseek() to reset the file pointer back to the beginning of the file, when you find the target word.

    My simple idea was to read in each word in the search space (making the search space as small as you need), comparing it with strcmp(). When strcmp() returns 0, you have found it.

    You won't need strtok anymore, either. Just change your fscanf() to:

    Code:
    fscanf(fp, "%s %c ", word, &wrdtype);
    I made this code but it's wrong.

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
      
    int main ()
    {
      char *b = malloc(sizeof(char)*1000);
      char *filedata[] = {"ability"};
      char * pch;
      FILE * fp;
      fp = fopen("worklist.txt", "r");
      /* insert this */
      if(!fp) {  //if fp was given NULL by fopen() meaning the file was not opened
          perror("Error: file worklist.txt was not found or opened");
          return 0;
      }
      printf("Input a short sentence: ");
      gets(b);
      pch = strtok (b," ,.-");
      while (pch != NULL)
      {
         
      if(strcmp(pch, filedata) == 0){
        // Read data back from file:
        fscanf(fp, "%s %c ", filedata, &pch);
        printf("%s\n",filedata);
        }
        pch = strtok(NULL, " ,.-");
      }
      getch();
      return 0;
    }
    I put the word into a char array and tried to get the while loop going but couldn't.

  7. #22
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    This is something I put together for you. It's not exactly what I was talking about, which is rather more complicated and difficult to learn from, perhaps.

    This is my sample wordlist.txt file contents, with the alphabetized words from a song in Sound of Music:

    and
    bless
    bloom
    blossom
    bright
    clean
    edelweiss
    every
    forever
    greet
    grow
    happy
    homeland
    look
    may
    me
    meet
    morning
    my
    of
    small
    snow
    to
    white
    you

    And this is what it's building - an index array, to shorten the search for a large number of words (much more than the small file shown here).

    We want this:
    Code:
    The Index Array Table:
    Letter	Starts At
    ======================
     a:       0
     b:       1
     c:       5
     d:       0
     e:       6
     f:       8
     g:       9
     h:       11
     i:       0
     j:       0
     k:       0
     l:       13
     m:       14
     n:       0
     o:       19
     p:       0
     q:       0
     r:       0
     s:       20
     t:       22
     u:       0
     v:       0
     w:       23
     x:       0
     y:       24
     z:       0
    The first word is,
    and

    located on line #0 in wordlist.txt

    bless
    which is to say, the start of the B's, is at line # 1.

    you
    is at line # 24.

    The a's should generally start with line 0, but any other letter with line #0, is a letter with no words which begin with that letter.

    If you add the above step to your search, and use a binary search, you will have a faster word searcher!

    I haven't finished the code for it, but this is the start of it, including the indexing. It's not polished up, and certainly not tested yet, so be aware -- it's just a prototype program at present.

    Code:
    /* just getting started. An example rather than a working utility */
    
    #include <stdio.h>
    #include <string.h>
    
    #define MAX 30
    
    char words[50][MAX];
    
    //int binary(char *, int, int);
    //void getInput(char *);
    
    
    int main(void) {
       int i, idx, n;
       FILE *fp;
       int index[27]={0};
       //char goal[MAX],
       char testchar;
       
    
       if((fp=fopen("wordlist.txt", "r")) == NULL) {
          perror("Error: unable to open the wordlist.txt file");
          return 0;
       }
       //build the index
       testchar='a'-1;
       n = i = 0;
       printf("%d\n", n); 
       while((fgets(words[n], MAX, fp)) != NULL) {
          if((words[n][0]) > testchar) {  
             testchar = words[n][0];
             index[testchar-'a'] = n;  //char - 'a' makes a=0, b=1, c=2, etc.
          }
          ++n;  //line counter
       }
       
       printf("\nThe Index Array Table:\nLetter\tStarts At\n");
       printf("======================\n");
       for(i=0;i<26;i++) {
          printf(" %c: %8d\n", (i+'a'), index[i]); //nice output now
       }
       
       putchar('\n');
       return 0;
    }
    /*
    void getInput(char goal[MAX]) {
       fgets(goal, MAX, stdin);
       if(goal[strlen(goal)-1]=='\n') //overwrite the newline
          goal[strlen(goal)-1]='\0';
    }
    */
    Last edited by Adak; 10-19-2011 at 07:59 PM.

  8. #23
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    That looks really good, Adak. Good luck with that. I'm not really making anything at this point so there's no rush.
    Looks good.

  9. #24
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    Adak,

    Can you fix this code I will post so it uses a array like in your 'Wickedly fast' thread.

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
      
    int main ()
    {
      char *b = malloc(sizeof(char)*1000);
      char *filedata = malloc(300);
      char * pch;
      FILE * fp;
      fp = fopen("worklist.txt", "r");
      /* insert this */
      if(!fp) {  //if fp was given NULL by fopen() meaning the file was not opened
          perror("Error: file worklist.txt was not found or opened");
          return 0;
      }
      printf("Input a short sentence: ");
      gets(b);
      pch = strtok (b," ,.-");
      while (pch != NULL)
      {
      if(strcmp(pch, "dog") == 0){
      // Set pointer to beginning of file:
      fseek( fp, 0L, SEEK_SET );
      // Read data back from file:
      // fscanf(fp, "%[^\n]", filedata);
      while(fgets(filedata, 300, fp))
        {
        if(memcmp(pch, filedata, strlen(pch)) == 0) break;
        else strcpy(filedata, "Not a match!");
        }
        printf("%s\n",filedata);
        }
        pch = strtok (NULL, " ,.-");
      }
      getch();
      return 0;
    }
    That way the console can list more than one word, I hope.

  10. #25
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Thanks. Been awhile since I worked with a word searcher, so it's good practice.

    That while loop in the middle of it, is how you can work through a word list, in sequential order. Just add your strcmp() code to it, and maybe a bit more, and you're good to go for a sequential search.

  11. #26
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    I now have a complete program for modifying words.
    Here is the code:

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <conio.h> // ***
    
    /* Removes the need to add a getch() before every return or exit statement in the program */
    void MyExit(void) { system("pause"); } // ***
    
    int main ()
    {
      /* Declare variables */
      char *b = malloc(sizeof(char)*1000);
      char *filedata = malloc(300);
      char *filedata_2 = malloc(300);
      char * pch;
      FILE * fp;
      FILE * sp;
    
      atexit(MyExit); // ***
      /* Fill the fp variable with a text file and read the text file */
      fp = fopen("worklist.txt", "r");
      sp = fopen("worklist_2.txt", "r");
    
      /* if the file was not opened */
      if(!fp) 
      {  
          perror("Error: file worklist.txt was not found or opened");
          return 0;
      }
    
        if(!sp) 
      {  
          perror("Error: file worklist_2.txt was not found or opened");
          return 0;
      }
      /* Input a sentence and fill the b variable */
      printf("Input a short sentence: ");
      gets(b);
    
      /* Take the b variable and split it into a word per line to fill the pch variable */
      pch = strtok (b," ,.-");
      /* read the pch variable until it is empty */
      while (pch != NULL)
      {
    	  /* Position the files to the start. 
    	  Otherwise the fgets will start reading from the last position when processing the previous word. */
    	  rewind (fp);
    	  /* reads the worklist file and writes the word to the filedata_2 variable. */
    	  if(fgets(filedata_2, 300, sp))
    	  {
      /* Uses the pch data in strlen, to compared the words in the pch and filedata_2 variables, to see if they are the same.  */
                    if(memcmp(pch, filedata_2, strlen(pch)) == 0)
      /* Position the files to the start. 
    	  Otherwise the fgets will start reading from the last position when processing the previous word. */
    			    rewind (sp);
    				{
      /* reads the worklist file and writes the word to the filedata variable. */
    					   while ((fgets(filedata, 300, fp)))
                                    {
      /* Uses the pch data in strlen, to compared the words in the pch and filedata variables, to see if they are the same.  */
                                      if(memcmp(pch, filedata, strlen(pch)) == 0) break;
                                      else strcpy(filedata, "Not a match!");
                                     }
      /* If the data in pch and filedata are the same then print the data in filedata */
                           printf("%s\n",filedata);
    				  }
    	  }
       /* Prevents the result from repeating endlessly */
    			   pch = strtok (NULL, " ,.-");
      }
      return 0;
    }
    You need two text files. One text file has the word with no 'n' or 'adj' etc next to them, just the word itself. One text file has the word and the 'n' or 'adj' beside it.

    Here is my text file contents;
    Code:
    worklist_2.txt
    The
    dog
    doggy
    
    worklist.txt
    The adj
    dog n
    doggy n
    Here is the results of me running the program and entering the words 'The dog':

    Code:
    Input a short sentence: The dog
    The adj
    
    dog n
    
    Press any key to continue . . .
    So now the basic program is finished and now I wonder what to do with it now. I posted this here because you helped me before and now I have the finished version i wanted to show it to you.
    I especially wanted to show it to Adak, so here you go.

  12. #27
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Congrat Jeremy! **

    First off, let me say that I explored the whole Index option in another thread recently. On the strongest PC hardware today, it didn't make any improvement in the search through thousands of words. A regular binary search was just as good, even though it made 30% more comparisons. The cost of making a comparison, is simply too small to make an Indexed search, worthwhile.

    This is the thread: read the last few entries especially.
    Word Searching with an Index

    When an entire novel (A Tale of Two Cities by Dickens), can have each word searched for (and either found or definitely not found), in less than 1/10th of a second, using regular binary search, it makes no sense to try and improve that search.

    I'm going to re-read just what you want to do here, and I'll edit this in a few minutes.

    The rewind() code suggests you're searching for each word in a sequential manner. If you have a huge amount of words, that will be a real waste of computer time.

    Are the words in the word list sorted? If not, we should discuss getting that done, first. Then you can use a much more efficient binary search (very easy to do, code for it is all in my Index thread. At the last code post), and really get your program up to speed.

    Your code is a real accomplishment, but with a few small changes, it will search 100 X faster, or more.

    ** "the" is a what? Depending on usage, it's usually an article
    Last edited by Adak; 10-22-2011 at 08:35 PM.

  13. #28
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    The program I just posted isn't using rewind(sp) correctly. I will try and fix it and post the code after.

  14. #29
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    Well I tried to fix the code but I couldn't and the person who I asked about it decided to just give it to me, telling me to learn what each line does. So I will. When I started to make the program I didn't think it would be hard, and I had it finished but the rewind part got me and had me beat.

    I decided that since Adak likes word programs I would post it here too.
    So here you go. I hang my head in shame and will try not to post in C forums until I know more and the code I was given.

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    
    void MyExit(void) { system("pause"); }
    
    int main ()
    {
      char *b = malloc(sizeof(char)*1000);
      char *filedata_2 = malloc(300);
      char * pch;
      FILE * sp;
    
      atexit(MyExit);
    
      sp = fopen("worklist_2.txt", "r");
    
      if(!sp)
        {
        perror("Error: file worklist_2.txt was not found or opened");
        return 0;
        }
    
      printf("Input a short sentence: ");
      gets(b);
    
      pch = strtok (b," ,.-");
      while(pch != NULL)
        {
        while(fgets(filedata_2, 300, sp))
          {
          if(_memicmp(pch, filedata_2, strlen(pch)) == 0
                && (filedata_2[strlen(pch)] == ' '
                || filedata_2[strlen(pch)] == '\n'))
            {
            break;
            }
          else
            {
            strcpy(filedata_2, pch);
            strcat(filedata_2, " No match!\n");
            }
          }
        printf("%s\n",filedata_2);
        rewind(sp);
        pch = strtok (NULL, " ,.-");
      }
      return 0;
    }
    The contents of the text file:

    The adj
    dog n
    doggy n

    Just run the program and if your entered word the word 'The' for example and it is in the list the modified word will appear on the console screen.

    Sorry to have jumped the horse before I could control it, but it's all done and I can go back to reading the book CommonTater gave me.
    Last edited by jeremy duncan; 10-23-2011 at 07:20 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. reading text-and-numbers file word by word
    By bored_guy in forum C Programming
    Replies: 22
    Last Post: 10-26-2009, 10:59 PM
  2. reading file word by word
    By 98holb in forum C Programming
    Replies: 2
    Last Post: 01-25-2006, 05:49 PM
  3. Reading in a file word by word
    By Bumblebee11 in forum C Programming
    Replies: 4
    Last Post: 06-10-2003, 09:39 PM
  4. open file, search of word, replace word with another
    By Unregistered in forum C++ Programming
    Replies: 0
    Last Post: 06-05-2002, 01:16 PM
  5. Help reading text file word by word
    By Unregistered in forum C++ Programming
    Replies: 6
    Last Post: 05-25-2002, 05:13 PM