Thread: Words occorrences count in C

  1. #1
    Registered User
    Join Date
    Jun 2018
    Posts
    5

    Post Words occorrences count in C

    I'm trying to create a program that giving it the path of one or more .txt file (or the path for a directory containing those files) it analyzes them and it creates a .out file containing all the words divided one per line each one having a number that indicates how many times the word is repeated in the file(s). It's not case sensitive and for "word" it considers only alphanumeric characters. Can someone help me please?
    Thank you

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    36,554
    What have you done so far?

    Can you for example read a text file:
    - and just print what you read.
    - and print "found word" for each valid word in the file.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Jun 2018
    Posts
    5
    Quote Originally Posted by Salem View Post
    What have you done so far?

    Can you for example read a text file:
    - and just print what you read.
    - and print "found word" for each valid word in the file.
    I can read a text file, but I don't know how to load multiple files or a directory of files, i can print what I read.
    What I was thinking to do is to print the text file re formatted in order to see one word for each line, then put the first word in the .out file and see in the reformatted txt file how many times the word repeats with a counter and the strcmp function, but i don't know how to reformat the file, how to implement the function for multiple words (i was thinking about an array of strings but it doesn't seem to work at all). I also need to make the words in the first file all in uppercase or lowercase for making the strcmp working on non-casesensitive words.

  4. #4
    Registered User john.c's Avatar
    Join Date
    Dec 2017
    Posts
    344
    How you read the directory depends on what OS you are using. On linux (and perhaps macOS) this program will read the current directory and print the first word in uppercase of any (regular) files whose filename ends in ".txt"
    Code:
    #define _BSD_SOURCE
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <unistd.h>
    #include <dirent.h>
    
    void all_toupper(char *s) {
        for ( ; *s; s++) *s = toupper(*s);
    }
    
    int main() {
        char word[100];
        struct dirent *e;
        DIR *dir = opendir(".");
    
        while ((e = readdir(dir)) != NULL) {
    
            struct stat st;
            if (stat(e->d_name, &st) == -1) {
                printf("Can't stat %s\n", e->d_name);
                continue;
            }
    
            if (S_ISREG(st.st_mode)) {  //if (e->d_type == DT_REG) {
                char *p = strrchr(e->d_name, '.');
                if (p && strcmp(p, ".txt") == 0) {
                    printf("%s\n", e->d_name);
                    FILE *f = fopen(e->d_name, "r");
                    fscanf(f, "%99s", word);
                    all_toupper(word);
                    printf("    %s\n", word);
                    fclose(f);
                }
            }
        }
    
        closedir(dir);
        return 0;
    }
    As for keeping track of the words, a balanced binary tree (such as a C++ std::map if you can switch to C++) would be perfect.
    Simplicity is the ultimate sophistication.

  5. #5
    Registered User
    Join Date
    Jun 2018
    Posts
    5
    I tried to solve this problem with the code below but I can't get inside the second while and I don't get why. Any tips?






    Code:
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    //LETTURA RIGA PER RIGA
    
    int main(void)
    {
        FILE * fp;
        FILE *fp2;
        FILE * fs;
        char * line = NULL;
        char * line2 = NULL;
        size_t len = 0;
        size_t len2 = 0;
        size_t read;
        size_t read2;
        int contatore=0;
        int inserimento;
        
    
        fp = fopen("swordx.out", "r"); //input
        fp2 = fopen("dio.txt", "r+");// output - in
        
        if (fp == NULL)
            exit(EXIT_FAILURE);
           read = getline(&line, &len, fp);
           fgets(line, 0, fp); // first row in the new file so i can compare
           fputs(line,fp2);
           
            
        while((    read = getline(&line, &len, fp))!= -1){
           
               inserimento=1; // like a boolean 
           
        
           
    
        while((    read2 = getline(&line2, &len2, fp2)) != -1){
            if (strcmp(line2,line) == 0){
          printf("The strings are equal.\n");
           printf("%s",line2);
             inserimento=0;
           
         }
       else{
          printf("The strings are not equal.\n");
           printf("%s",line2);
           
         
       }
    }// second while
      if(inserimento==1){ // if it has no duplicate in the output file we can write on it
          fgets(line, 0, fp);
          fputs(line,fp2);}
         
          
          
        }// first while
    
        fclose(fp);
        fclose(fp2);
        
        
        if (line)
            free(line);
            
            if (line2)
            free(line2);
          
        exit(EXIT_SUCCESS);
        
    
    }

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    36,554
    Well first of all, try to indent your code better so you can see what is going on.
    Code:
    #define _GNU_SOURCE
    #include <stdio.h> 
    #include <stdlib.h> 
    #include <string.h>
      //LETTURA RIGA PER RIGA
    
    int main(void) {
        FILE * fp;
        FILE * fp2;
        FILE * fs;
        char * line = NULL;
        char * line2 = NULL;
        size_t len = 0;
        size_t len2 = 0;
        size_t read;
        size_t read2;
        int contatore = 0;
        int inserimento;
    
        fp = fopen("swordx.out", "r"); //input
        fp2 = fopen("dio.txt", "r+"); // output - in
    
        if (fp == NULL)
          exit(EXIT_FAILURE);
        read = getline( & line, & len, fp);
        fgets(line, 0, fp); // first row in the new file so i can compare
        fputs(line, fp2);
    
        while ((read = getline( & line, & len, fp)) != -1) {
    
          inserimento = 1; // like a boolean 
    
          while ((read2 = getline( & line2, & len2, fp2)) != -1) {
            if (strcmp(line2, line) == 0) {
              printf("The strings are equal.\n");
              printf("%s", line2);
              inserimento = 0;
    
            } else {
              printf("The strings are not equal.\n");
              printf("%s", line2);
    
            }
          } // second while
          if (inserimento == 1) { // if it has no duplicate in the output file we can write on it
            fgets(line, 0, fp);
            fputs(line, fp2);
          }
    
        } // first while
    
        fclose(fp);
        fclose(fp2);
    
        if (line)
          free(line);
    
        if (line2)
          free(line2);
    
        exit(EXIT_SUCCESS);
    
    }
    There are two main problems with your code.
    > while ((read2 = getline( & line2, & len2, fp2)) != -1)
    This reads to the end of the file.
    Which means if you want to read the file again, you need to use rewind() to get back to the beginning of the file.

    > fputs(line, fp2);
    After you do this, you need to call fflush() before you try to read from the file again.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Registered User
    Join Date
    Jun 2018
    Posts
    5
    Here I am again... So i changed everything and used linked lists... Now i have this program which has one big problem: if two or more identical words are one next to each other, they will not be written only one time like it should be...

    For example : INPUT = " Hi my name is sciroppino hi my name is sciroppino " OUTPUT =" hi 2, my 2, name 2, is 2, sciroppino 2" ( CORRECT)
    INPUT=" Hi hi my my name name is is sciroppino sciroppino " OUTPUT=" hi 1 , hi 1, my 1, my 1, name 1, name 1 .........."( INCORRECT)


    Here is the code:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <ctype.h>
    
    typedef struct s_words {
        char *str;                  //word
        int count;                  //number of times word occurs
        struct s_words *next;       //pointer to next word
    } words;
    
    words *create_words (char *word)
    {
        //+allocate space for the structure
        printf ("%lu ", strlen (word));
        words *newWord = malloc (sizeof (words));
        if (NULL != newWord) {
            //+allocate space for storing the new word in "str"
            //+if str was array of fixed size, storage wud be wasted
            newWord->str = (char *) malloc ((strlen (word)) + 1);
            strcpy (newWord->str, word);    //+copy “word” into newWord->str
            newWord->str[strlen (word)] = '\0';
            printf (" Create: %s ", newWord->str);
            //+initialize count to 1;
            newWord->count = 1;
            //+initialize next;
            newWord->next = NULL;
        }
        return newWord;
    }
    
    words *add_word (words **wordList, char *word)
    {
        if (!*wordList) {       /* handle EMPTY list */
            printf ("NEW LIST\n");
            return *wordList = create_words (word);
        }
    
        words *temp = *wordList;
        //+ search if word exists in the list; if so, make found=1
        while (temp->next != NULL) {    /* iterate while temp->next != NULL */
    
            if (strcmp (temp->str, word) == 0) {    //+use strcmp command
                temp->count = temp->count + 1;      //+increment count;
                return *wordList;
            }
            else
                temp = temp->next;  //+update temp
        }
        words *newWord = create_words (word);
        if (NULL != newWord) {  /* insert at TAIL of list */
            temp->next = newWord; 
            printf (" NEW WORD: %s\n ", newWord->str);
        }
        return newWord;
    }
    
    int main (int argc, char *argv[]) {
    
        words *mywords;             //+head of linked list containing words
        mywords = NULL;
        char *delim = ". ,:;\t\n";
    
        FILE *myFile;
        FILE *myOutput;
    
        char *filename = argv[1];
        char *outputfile = argv[2];
    
        if (argc != 3) {
            fprintf (stderr, "error: insufficient input. usage: %s ifile ofile\n",
                    argv[0]);
            return 1;
        }
    
        myFile = fopen (filename, "r");     //+first parameter is input file
        if (myFile == 0) {
            printf ("file not opened\n");
            return 1;
        } else {
            printf ("file opened \n");
        }
    
        //+start reading file character by character;
        //+when word has been detected; call the add_word function
    
        int ch = 0, word = 1, k = 0;
        char thisword[100];
        while ((ch = fgetc (myFile)) != EOF) {  /* for each char    */
            if (strchr (delim, ch)) {           /* check if delim   */
                if (word == 1) {    /* if so, terminate word, reset */
                    word = 0;
                    thisword[k] = '\0';
    
                    printf ("\nadd_word (mywords, %s)\n", thisword);
                    /* do NOT overwrite list address each time,
                     * you must send ADDRESS of list to add_word
                     * to handle EMPTY list case.
                     */
                    if (add_word (&mywords, thisword))
                        printf (" added: %s\n", mywords->str);
                    else
                        fprintf (stderr, "error: add_word failed.\n");
    
                    k = 0;
                }
            }
            else {  /* if not delim, add char to string, set word 1 */
                word = 1;
                thisword[k++] = tolower (ch);   /* make ch lowercase */
            }
        }
        if (word == 1) {    /* handle non-POSIX line-end */
            thisword[k] = '\0';
            //add thisword into the list
            printf ("\nadd_word (mywords, %s) (last)\n", thisword);
            if (add_word (&mywords, thisword))  /* same comment as above */
                printf (" added: %s\n", mywords->str);
            else
                fprintf (stderr, "error: add_word failed.\n");
        }
    
        words *currword;
        printf ("printing list\n");
    
        //+Traverse list and print each word and its count to outputfile
        //+output file is second parameter being passed
    
        myOutput = fopen (outputfile, "w+");        //+first parameter is input file
        if (myOutput == 0) {
            printf ("output file not opened \n");
            return 1;
        } else {
            printf ("output file opened \n");
        }
    
        currword = mywords;
    
        while (currword != NULL) {  /* just test currword here */
            //add word name then word count to file, then move to next
            fprintf (myOutput, "%s %d \n", currword->str, currword->count);
            printf ("%s ", currword->str);
            currword = currword->next;
        }
    
        putchar ('\n');
        return 0;
    }

  8. #8
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    25,895
    Eh, why did you change the code to use linked lists? It seems to me that an array would have sufficed, if necessary a dynamic array, and if you wanted something more sophisticated, you might have gone with a hash table instead. Linked lists don't seem to have any advantage here.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  9. #9
    Registered User
    Join Date
    Jun 2018
    Posts
    5
    Quote Originally Posted by laserlight View Post
    Eh, why did you change the code to use linked lists? It seems to me that an array would have sufficed, if necessary a dynamic array, and if you wanted something more sophisticated, you might have gone with a hash table instead. Linked lists don't seem to have any advantage here.
    I had some problems with the counters while using lot of files and directories.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. count words
    By 74466 in forum C++ Programming
    Replies: 4
    Last Post: 02-17-2006, 09:30 AM
  2. How to count the words in sentence ?
    By Th3-SeA in forum C Programming
    Replies: 1
    Last Post: 10-01-2003, 01:34 AM
  3. words count
    By arlenagha in forum C++ Programming
    Replies: 2
    Last Post: 03-06-2003, 09:29 AM
  4. how to count sentences and words?
    By Ray Thompson in forum C Programming
    Replies: 1
    Last Post: 11-08-2002, 01:42 PM
  5. Replies: 2
    Last Post: 05-05-2002, 01:38 PM

Tags for this Thread