Thread: Keywords searching in the extern English txt

  1. #1
    Registered User
    Join Date
    Oct 2012
    Location
    Svitavy, Czech Republic
    Posts
    37

    Keywords searching in the extern English txt

    Hello,
    can you please advise me some idea, how could I make program, which will find keywords in file.txt (file is in English)? I don't know, which procedure should I choose, what should to do it in the shortest time? What should I do with signs in text and with spaces?

    Thank you.

  2. #2
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Well a simple solution would be to do the following :
    • open the file with fopen
    • Read the data of the file in a char buffer with fread. ( You can not know in standard C the size of the file without reading it, so I an easy fix is to set the size of your buffer to a reasonable number. Too much and you waste memory, too little and you have an overflow. )
    • Then everything is in the buffer. Every cell of the buffer has a character. This can be a letter, a space, a sign, whatever...
    • Traverse the buffer to collect the data you like
    • Do not forget to close the file with fclose
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  3. #3
    Registered User
    Join Date
    May 2012
    Posts
    505
    Quote Originally Posted by maestorm View Post
    Hello,
    can you please advise me some idea, how could I make program, which will find keywords in file.txt (file is in English)? I don't know, which procedure should I choose, what should to do it in the shortest time? What should I do with signs in text and with spaces?

    Thank you.
    Open the file with fopen().
    Declare a big character buffer, maybe 8192 characters long.
    Call fgets() to read the lines.
    For each line, go through your list of keywords, calling strstr().
    If strstr returns non-null, you have a hit.
    Close the file.
    I'm the author of MiniBasic: How to write a script interpreter and Basic Algorithms
    Visit my website for lots of associated C programming resources.
    https://github.com/MalcolmMcLean


  4. #4
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    Suppose we define the symbols
    K: keyword to be searched
    L: line buffer large enough to hold the longest line
    W: word buffer large enough to hold the longest word
    D: delimiters

    Then here is an algorithm to print each line in the input that contains the keyword

    Code:
    while there are more input lines available
    {
        read an input line into L
        begin tokenization of L on D
        while there are more tokens in L
        {
            read the next token into W
            if W == K
                print L
        }
        end tokenization of L
    }
    To answer the question about "signs and spaces" : tokenization involves choosing separating characters, referred to as "delimiters". For example, let's tokenize the string S
    Code:
    one,two,three four!! five six;seven!? eight nine@ten
    on the delimiters D = [,! ;?@] where I have used the [...] notation here to denote a set of characters. The result of the tokenization of S on D would return the tokens

    one
    two
    three
    ...
    ten
    Last edited by c99tutorial; 12-15-2012 at 02:14 PM.

  5. #5
    Registered User
    Join Date
    Oct 2012
    Location
    Svitavy, Czech Republic
    Posts
    37

    Unhappy Still I can't find right way how to search keywords in file.txt

    Please help me, I can still make it right. made this code:

    Code:
    #include <stdlib.h>
    #include <ctype.h>
    
    int main(void) {
    FILE *fr;
    char s[30];
      
    fr = fopen("file.txt", "r");
        if (!fr) {
            fputs("Nemohu otevřít vstupní soubor.\n", stderr);
            return 1;
        }
        while (fgets(s, sizeof(s), fr) != NULL) {
        
        int p;
        char array[p];
        for (p=0; p<=30;p++) {
                while (isalpha(s[p]) && isalpha(s[p++])) {
                    s[p] = array[p];
                }
            printf("%c", s[p]);
                while (!(s[p] >= 'A' && s[p] <= 'Z')) {
                    s[p] = s[p+1];
                }
            printf("%c", s[p]);
        }
    
            fputs(s, stdout);
        }
        fclose(fr);
        return 0;
    }
    , I am thinking about I could make it somehow like:
    Code:
    do {
      if (isalpha(s[p]) {
        s[p] = pole[p];
        s[p++];
      }
      else s[p++];
    } while (s[p] != EOF);
    I need to make it today, please help me someone, I have problems with buffer overflow and I don't know if I am thinking right, I need to see how should right working code is working and I would like to understand it from it.

    Thank you.
    Greetings, lost Tom.

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Tom, how many keywords are you searching for in the text?

    How big is the text you're looking through?

    What do you want done with the keywords that are found in the text?

  7. #7
    Registered User
    Join Date
    Oct 2012
    Location
    Svitavy, Czech Republic
    Posts
    37
    I am searching in 6,4 MB but it should be working on 10 MB text and it should display 10 keywords that is all, but I still don't know how to make it correct, I am sad and helpless.

    Thank you Adak, if you are going to help me.

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    That's what I'm here for. What do you want done with the keywords, when they're found, though?

    Just count them? Put the line of text they're found in, in a file, or maybe printed to the screen, what are we doing here?

    The search part is quite easy and quick.

  9. #9
    Registered User
    Join Date
    Oct 2012
    Location
    Svitavy, Czech Republic
    Posts
    37
    Search part is my biggest problem, keywords I want just to print on screen, thank you.

    Please can you show me the right code?

  10. #10
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Sure, give me a few minutes to rough something up.

  11. #11
    Registered User
    Join Date
    Oct 2012
    Location
    Svitavy, Czech Republic
    Posts
    37
    Thank you, I go for lunch, I'll be back soon I am so happy, that so great people like you are here.

    happy Tom

  12. #12
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    I'm not clear on the approach being used in the code. To look for keywords, why are you using isalpha and comparisons with 'A' and 'Z'? You can do it this way, but you might be getting lost in details. For illustration I made a direct translation of my above algorithm.

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdbool.h>
    
    #define MAXLINE 100000
    const char DELIM[] = ".,! ;?@\n";
    char line[MAXLINE] = "";
    char line_s[MAXLINE] = "";
    
    int main()
    {
        char keyword[] = "hello";     // search for this word
        int lineno = 0;                // track line number
    
        // In the following comments, these abbreviations apply
        // L: line, line_s
        // D: DELIM
        // W: word
        // K: keyword
        
        // while there are more input lines available,
        // read an input line into L
        while (fgets(line, MAXLINE, stdin) != NULL) {
            lineno++;
            // begin tokenization of L on D
            strcpy(line_s, line);
            char *str = line_s;
            // while there are more tokens in L
            while (true) {
                char *word;
                // read the next token into W
                if ((word = strtok(str, DELIM)) == NULL)
                    break; // no more tokens
                str = NULL;
                // if W == K, print L
                if (strcmp(word, keyword) == 0)
                    printf("%d: %s", lineno, line);
            }
            // end tokenization of L
        }
        return 0;
    }
    Replace stdin with the name of your file pointer, and the behaviour should be correct on your file that is 6 MB or 100 MB or gigabytes or whatever. The only limitation: input lines are assume to have a length of MAXLINE or less. Keywords which appear on a line longer than this may not be found.

  13. #13
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Pretty similar to C99's version above. I would consider this to be the basic text searcher.

    Code:
    #include <stdio.h>
    #include <string.h>
    
    #define SIZE 5
    
    int main(void) {
       int i;
       char *pchr=NULL;
       const char words[SIZE][30]={{"every"},{"white"},{"greet"},{"me"},{"small"},};
       char filename[50];
       char line[100];
       FILE *fp;
       printf("Enter a filename in this directory: ");
       fflush(stdout);
       
       scanf("%s", filename);
       if((fp=fopen(filename, "r"))==NULL) {
          printf("Error opening file!\n");
          return 1;
       }   
    
       while(fgets(line, sizeof(line), fp)!=NULL) {
          for(i=0;i<SIZE;i++) {
             if((pchr=strstr(line, words[i]))!=NULL) {  //case sensitive search
                printf("words[i]: %s in: %s\n",words[i],line);
             }
    
          }
       }
       fclose(fp);
       printf("\n");
       return 0;
    }
    The words were taken from this text file: "edelweiss.txt".
    Edelweiss, edelweiss, every morning you greet me.
    Small and white, clean and bright, you look happy to meet me.
    Blossom of snow may you bloom and grow, bloom and grow forever.
    Edelweiss, edelweiss, bless my homeland forever.

  14. #14
    Registered User
    Join Date
    Oct 2012
    Location
    Svitavy, Czech Republic
    Posts
    37
    But I don't need to find specific keywords, I need to find keywords which are most repeated in English text, text is with accents, I need to compare these words and print on screen 10 most repeated (I mean words, not conjunctions).

    Please could you make it somehow? I am helpless, thank you for your codes, I am looking on that.

  15. #15
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    There are word lists that will give this info to you. This is one with the most common 1,000 words in English. Just remove the conjunctions you don't want and you're done, yes?

    1000 most common English words

    The people who make these lists, read millions (sometimes billions) of books, magazines, newspapers, internet posts, etc. They're giving you a LOT of information here. I very much doubt if you will want to duplicate their extensive work.

    I built up a lot of my own word lists from books and text on the internet, but when I saw the word lists that were already freely available on-line, I had to say, they did a lot more work than I had.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. keywords
    By C_ntua in forum C++ Programming
    Replies: 5
    Last Post: 09-25-2008, 08:49 AM
  2. Implementing a English-Spanish/Spanish-English Dictionary
    By invertedMirrors in forum C Programming
    Replies: 4
    Last Post: 02-23-2008, 03:48 PM
  3. top keywords of the day
    By dP munky in forum A Brief History of Cprogramming.com
    Replies: 9
    Last Post: 02-28-2003, 12:43 PM
  4. C++ Keywords
    By Cgawd in forum C++ Programming
    Replies: 14
    Last Post: 11-10-2002, 06:23 AM
  5. new keywords
    By Shadow12345 in forum C++ Programming
    Replies: 8
    Last Post: 07-25-2002, 02:57 AM