Keywords searching in the extern English txt

**maestorm** · 12-15-2012

Hello,
can you please advise me some idea, how could I make program, which will find keywords in file.txt (file is in English)? I don't know, which procedure should I choose, what should to do it in the shortest time? What should I do with signs in text and with spaces?

Thank you.

**std10093** · 12-15-2012

Well a simple solution would be to do the following :

open the file with fopen
Read the data of the file in a char buffer with fread. ( You can not know in standard C the size of the file without reading it, so I an easy fix is to set the size of your buffer to a reasonable number. Too much and you waste memory, too little and you have an overflow. )
Then everything is in the buffer. Every cell of the buffer has a character. This can be a letter, a space, a sign, whatever...
Traverse the buffer to collect the data you like
Do not forget to close the file with fclose

**Malcolm McLean** · 12-15-2012

Originally Posted by maestorm

Hello,
can you please advise me some idea, how could I make program, which will find keywords in file.txt (file is in English)? I don't know, which procedure should I choose, what should to do it in the shortest time? What should I do with signs in text and with spaces?

Thank you.

Open the file with fopen().
Declare a big character buffer, maybe 8192 characters long.
Call fgets() to read the lines.
For each line, go through your list of keywords, calling strstr().
If strstr returns non-null, you have a hit.
Close the file.

**c99tutorial** · 12-15-2012

Suppose we define the symbols
K: keyword to be searched
L: line buffer large enough to hold the longest line
W: word buffer large enough to hold the longest word
D: delimiters

Then here is an algorithm to print each line in the input that contains the keyword

Code:

while there are more input lines available
{
    read an input line into L
    begin tokenization of L on D
    while there are more tokens in L
    {
        read the next token into W
        if W == K
            print L
    }
    end tokenization of L
}

To answer the question about "signs and spaces" : tokenization involves choosing separating characters, referred to as "delimiters". For example, let's tokenize the string S

Code:

one,two,three four!! five six;seven!? eight nine@ten

on the delimiters D = [,! ;?@] where I have used the [...] notation here to denote a set of characters. The result of the tokenization of S on D would return the tokens

one
two
three
...
ten

**maestorm** · 12-16-2012

Please help me, I can still make it right. made this code:

Code:

#include <stdlib.h>
#include <ctype.h>

int main(void) {
FILE *fr;
char s[30];
  
fr = fopen("file.txt", "r");
    if (!fr) {
        fputs("Nemohu otevřít vstupní soubor.\n", stderr);
        return 1;
    }
    while (fgets(s, sizeof(s), fr) != NULL) {
    
    int p;
    char array[p];
    for (p=0; p<=30;p++) {
            while (isalpha(s[p]) && isalpha(s[p++])) {
                s[p] = array[p];
            }
        printf("%c", s[p]);
            while (!(s[p] >= 'A' && s[p] <= 'Z')) {
                s[p] = s[p+1];
            }
        printf("%c", s[p]);
    }

        fputs(s, stdout);
    }
    fclose(fr);
    return 0;
}

, I am thinking about I could make it somehow like:

Code:

do {
  if (isalpha(s[p]) {
    s[p] = pole[p];
    s[p++];
  }
  else s[p++];
} while (s[p] != EOF);

I need to make it today, please help me someone, I have problems with buffer overflow and I don't know if I am thinking right, I need to see how should right working code is working and I would like to understand it from it.

Thank you.
Greetings, lost Tom.

**Adak** · 12-16-2012

Tom, how many keywords are you searching for in the text?

How big is the text you're looking through?

What do you want done with the keywords that are found in the text?

**maestorm** · 12-16-2012

I am searching in 6,4 MB but it should be working on 10 MB text and it should display 10 keywords that is all, but I still don't know how to make it correct, I am sad and helpless.

Thank you Adak, if you are going to help me.

**Adak** · 12-16-2012

That's what I'm here for. What do you want done with the keywords, when they're found, though?

Just count them? Put the line of text they're found in, in a file, or maybe printed to the screen, what are we doing here?

The search part is quite easy and quick.

**maestorm** · 12-16-2012

Search part is my biggest problem, keywords I want just to print on screen, thank you.

Please can you show me the right code?

**Adak** · 12-16-2012

Sure, give me a few minutes to rough something up.

**maestorm** · 12-16-2012

Thank you, I go for lunch, I'll be back soon

I am so happy, that so great people like you are here.

happy Tom

**c99tutorial** · 12-16-2012

I'm not clear on the approach being used in the code. To look for keywords, why are you using isalpha and comparisons with 'A' and 'Z'? You can do it this way, but you might be getting lost in details. For illustration I made a direct translation of my above algorithm.

Code:

#include <stdio.h>
#include <string.h>
#include <stdbool.h>

#define MAXLINE 100000
const char DELIM[] = ".,! ;?@\n";
char line[MAXLINE] = "";
char line_s[MAXLINE] = "";

int main()
{
    char keyword[] = "hello";     // search for this word
    int lineno = 0;                // track line number

    // In the following comments, these abbreviations apply
    // L: line, line_s
    // D: DELIM
    // W: word
    // K: keyword
    
    // while there are more input lines available,
    // read an input line into L
    while (fgets(line, MAXLINE, stdin) != NULL) {
        lineno++;
        // begin tokenization of L on D
        strcpy(line_s, line);
        char *str = line_s;
        // while there are more tokens in L
        while (true) {
            char *word;
            // read the next token into W
            if ((word = strtok(str, DELIM)) == NULL)
                break; // no more tokens
            str = NULL;
            // if W == K, print L
            if (strcmp(word, keyword) == 0)
                printf("%d: %s", lineno, line);
        }
        // end tokenization of L
    }
    return 0;
}

Replace stdin with the name of your file pointer, and the behaviour should be correct on your file that is 6 MB or 100 MB or gigabytes or whatever. The only limitation: input lines are assume to have a length of MAXLINE or less. Keywords which appear on a line longer than this may not be found.

**Adak** · 12-16-2012

Pretty similar to C99's version above. I would consider this to be the basic text searcher.

Code:

#include <stdio.h>
#include <string.h>

#define SIZE 5

int main(void) {
   int i;
   char *pchr=NULL;
   const char words[SIZE][30]={{"every"},{"white"},{"greet"},{"me"},{"small"},};
   char filename[50];
   char line[100];
   FILE *fp;
   printf("Enter a filename in this directory: ");
   fflush(stdout);
   
   scanf("%s", filename);
   if((fp=fopen(filename, "r"))==NULL) {
      printf("Error opening file!\n");
      return 1;
   }   

   while(fgets(line, sizeof(line), fp)!=NULL) {
      for(i=0;i<SIZE;i++) {
         if((pchr=strstr(line, words[i]))!=NULL) {  //case sensitive search
            printf("words[i]: %s in: %s\n",words[i],line);
         }

      }
   }
   fclose(fp);
   printf("\n");
   return 0;
}

The words were taken from this text file: "edelweiss.txt".

Edelweiss, edelweiss, every morning you greet me.
Small and white, clean and bright, you look happy to meet me.
Blossom of snow may you bloom and grow, bloom and grow forever.
Edelweiss, edelweiss, bless my homeland forever.

**maestorm** · 12-16-2012

But I don't need to find specific keywords, I need to find keywords which are most repeated in English text, text is with accents, I need to compare these words and print on screen 10 most repeated (I mean words, not conjunctions).

Please could you make it somehow? I am helpless, thank you for your codes, I am looking on that.

**Adak** · 12-16-2012

There are word lists that will give this info to you. This is one with the most common 1,000 words in English. Just remove the conjunctions you don't want and you're done, yes?

1000 most common English words

The people who make these lists, read millions (sometimes billions) of books, magazines, newspapers, internet posts, etc. They're giving you a LOT of information here. I very much doubt if you will want to duplicate their extensive work.

I built up a lot of my own word lists from books and text on the internet, but when I saw the word lists that were already freely available on-line, I had to say, they did a lot more work than I had.

Thread: Keywords searching in the extern English txt

Thread Tools

Search Thread

Display

Keywords searching in the extern English txt

Still I can't find right way how to search keywords in file.txt

Similar Threads

keywords

Implementing a English-Spanish/Spanish-English Dictionary

top keywords of the day

C++ Keywords

new keywords