Thread: Word counting program - help.

  1. #1
    Registered User
    Join Date
    Feb 2014
    Posts
    4

    Word counting program - help.

    Hi everyone,
    New poster here, keen to get involved in some serious C programming. I'm currently working on a program to take ASCII input from a file, and count the occurrences of each individual word within that file. The program should ignore punctuation and treat whitespace as the end of a particular string. It should also convert uppercase letters to lowercase letters to avoid duplicates. All of these strings and their counts should be stored in an array, which should then be sorted and printed to a CSV file.

    This is what I have so far. Bear in mind that this is a partial implementation, and the bits that work are by no means perfect:

    Code:
    #include <stdio.h>#include <string.h>
    #include <ctype.h>
    #include <stdlib.h>
    #include <errno.h>
    
    
    #define TRUE 1
    #define FALSE 0
    
    
    /* This program is designed to take an ASCII input file, count the occurrences of words in it
     * and write an output file displaying the data. I intend for it to convert uppercase to 
     * lowercase, so as not to generate duplicate words in the data structure. It should also 
     * ignore whitespace and punctuation.
    */
    
    
    void getWords(void);
    void printFile(void);
    void save(char *input);
    
    
    struct word {
        char *str;
        int wc;
    };
    
    
    struct word *warray = NULL;
    
    
    FILE *infile;
    FILE *outfile;
    
    
    int elements = 0;
    
    
    void getWords(void)
    {
        
        char cw[100]; // Current word storage
        int fullWord = FALSE;
        int i = 0, j = 0, c;
    
    
        while((c = fgetc(infile)) != EOF)
        {
            if(isalpha(c))
            {
                if(isupper(c))
                {
                    cw[i] = tolower(c);
                    ++i;
                }
                else
                {
                    cw[i] = c;
                    ++i;
                }
                fullWord = TRUE;
            }
            else
            {
                if(fullWord)
                {
                    cw[i] = '\0';
                    i = 0;
                    fullWord = FALSE;
                    save(cw);
                }
            }
            
        }
        
        return;
    }
    
    
    void printFile(void)
    {
    
    
        int i, c;
        
        printf("Printing the file to be counted in lowercase...\n");
        for(i = 0; (c = fgetc(infile)) != EOF; i++)
        {
            if(ispunct(c) || isdigit(c))
            {
                ++i;
            }
            else
            {
                putchar(tolower(c));
            
            }
    
    
        }
        printf("\n");
        rewind(infile);
    }
    
    
    void save(char *input)
    {
    
    
        int i = 0;
    
    
        if(!warray)
        {
            warray = malloc(sizeof(struct word));
            printf("Made array.\n");
        }
        else
        {
            printf("New\n");
            for(i = 0; i < elements; i++)
            {
                printf("in for loop.\n");
                if (strcmp(input, warray[i].str) == 0)
                {
                    printf("exists\n");
                    warray[i].wc++;
                }
                else
                {
                    printf("New element\n");
                    warray = realloc(warray, (elements+1)*sizeof(struct word));
                    printf("Elements = %d\n", elements);
                    warray[elements].str = malloc(strlen(input)+1);
                    strcpy(warray[elements].str,input);
                    warray[elements].wc = 1;
                }
            }
        }
        return;
    }
    
    
    
    
    int main (int argc, char *argv[])
    {
        
    
    
        if (argc < 3)
        {
            puts("Please supply the input filename and desired output filename as arguments.");
            return 1;
        }
        
        infile = fopen(argv[1], "r");
        if(infile == NULL)
        {
            printf("File failed to open. Error: %d\n", errno);
            return 1;
        }
        else
        {
            puts("File opened successfully.");
            printFile();
            getWords();
        }
        printf("%d", elements);
        return 0;
    
    
    }
    The problem I'm having is that my "elements" variable doesn't seem to be incrementing, so the save function also doesn't seem to actually be doing anything. I assumed that the reason for this was the following:

    Code:
    for(i = 0; i < elements; i++)        {
                printf("in for loop.\n");
                if (strcmp(input, warray[i].str) == 0)
                {
                    printf("exists\n");
                    warray[i].wc++;
                }
                else
                {
                    printf("New element\n");
                    warray = realloc(warray, (elements+1)*sizeof(struct word));
                    printf("Elements = %d\n", elements);
                    warray[elements].str = malloc(strlen(input)+1);
                    strcpy(warray[elements].str,input);
                    warray[elements].wc = 1;
                }
            }
    The "elements" variable is set globally to 0, and is supposed to be incremented within the save function. I assumed that this was causing the problem with regards to the above for loop, so I changed the initial value of elements to 1, which resulted in a segmentation fault.

    I'm not exactly sure what to do here, can anyone help me out?

    Thanks in advance

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    The first thing I suggest is to turn all those global variables into local variables, then pass the specific variables needed into the respective functions.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Jan 2009
    Location
    Australia
    Posts
    375
    Welcome to the forum.

    Great start. Good modularity, good indentation.

    Quote Originally Posted by arevans View Post
    Code:
    for(i = 0; (c = fgetc(infile)) != EOF; i++)
        {
            if(ispunct(c) || isdigit(c))
            {
                ++i;
            }
            else
            {
                putchar(tolower(c));
            
            }
    }
    That is kind of a mess and doesn't really make sense. You're not actually using i for anything. What I've been told (and I think it's quite good advice) is to try to only use for loops as bounded loops. That means you should only use them when you know exactly how many iterations they will run for before the loop starts executing. You would be better off using a while loop here. You can just test for when a character is not punct or digit and then you can output it. There is no need for you to increment i or anything.

    Code:
     if(isupper(c))
                {
                    cw[i] = tolower(c);
                    ++i;
                }
                else
                {
                    cw[i] = c;
                    ++i;
                }
    tolower( inChar ) will make inChar lowercase if it is an uppercase letter, otherwise it will just leave it lowercase. You can safely make this code into:
    Code:
    cw[i] = tolower[c];
    ++i;
    fullWord is kind of a bad name for a variable. Your use of it doesn't reflect it's name. You're setting it to true each time you get a character. From the name alone I would expect it to be set each time a full word is read. You can either change the name or change the usage.

    I would make the 100 storage size of your cw array into a preprocessor constant just like you have done with true and false. I would also put a check in your reading loop to stop reading and either save the word or output an error if you go past 99 characters.

    Code:
    void save(char *input)
    {
        int i = 0;
    
    
        if(!warray)
        {
            warray = malloc(sizeof(struct word));
            printf("Made array.\n");
        }
        else
        {
            printf("New\n");
            for(i = 0; i < elements; i++)
            {
                printf("in for loop.\n");
                if (strcmp(input, warray[i].str) == 0)
                {
                    printf("exists\n");
                    warray[i].wc++;
                }
                else
                {
                    printf("New element\n");
                    warray = realloc(warray, (elements+1)*sizeof(struct word));
                    printf("Elements = %d\n", elements);
                    warray[elements].str = malloc(strlen(input)+1);
                    strcpy(warray[elements].str,input);
                    warray[elements].wc = 1;
                }
            }
        }
        return;
    }
    You're initialising elements to 0 and so elements+1 will result in you reallocing an array with only 1 element. Also you are never at any point freeing the data that you are mallocing (and reallocing). You need to do this to prevent memory leaks.

    After you have that fixed you should work on eliminating the global variables in your program. In this particular case you don't really need to have globals, so it should be fairly simple to eliminate them.

    For example:
    Code:
    typedef struct
    {
        struct word
        {
            char *wordStr;
            int occurrences;
        } *fileWords;
        int numWords;
    } WordData;
    
    int main( void )
    {
        WordData wd;
     
        save( &wd );
    
        /* free stuff here */
    
        return 0;
    }
    Once you have eliminated the globals, you should have a nice little program on your hands.
    Last edited by DeadPlanet; 02-23-2014 at 07:07 AM.

  4. #4
    Registered User
    Join Date
    Feb 2014
    Posts
    4
    Quote Originally Posted by DeadPlanet View Post
    You're initialising elements to 0 and so elements+1 will result in you reallocing an array with only 1 element. Also you are never at any point freeing the data that you are mallocing (and reallocing). You need to do this to prevent memory leaks.

    After you have that fixed you should work on eliminating the global variables in your program. In this particular case you don't really need to have globals, so it should be fairly simple to eliminate them.

    For example:
    Code:
    typedef struct
    {
        struct word
        {
            char *wordStr;
            int occurrences;
        } *fileWords;
        int numWords;
    } WordData;
    
    int main( void )
    {
        WordData wd;
     
        save( &wd );
    
        /* free stuff here */
    
        return 0;
    }
    Once you have eliminated the globals, you should have a nice little program on your hands.
    Thanks for the welcome, and the kind words, I appreciate it. I've made most of the changes you suggested, but I am confused about one of the suggestions above.

    If I free the memory I've allocated to hold each string (and its count), won't I lose all the elements that I've saved previously? My thought process was that reallocating extra memory for each new element would prevent that.

  5. #5
    Registered User
    Join Date
    Jan 2009
    Location
    Australia
    Posts
    375
    Yes, you lose everything. You need to free the memory after you no longer require it. Exiting the program doesn't automatically release all the memory to the operating system, or at least you shouldn't assume that it does.

  6. #6
    Registered User
    Join Date
    Feb 2014
    Posts
    4
    Noted. I'm still having trouble with this for loop:

    Code:
            for(i = 0; i < elements; i++)
            {
                printf("in for loop.\n");
                if (strcmp(input, warray[i].str) == 0)
                {
                    printf("exists\n");
                    warray[i].wc++;
                }
                else
                {
                    printf("New element\n");
                    warray = realloc(warray, (++elements)*sizeof(struct word));
                    printf("Elements = %d\n", elements);
                    warray[elements].str = malloc(strlen(input)+1);
                    strcpy(warray[elements].str,input);
                    warray[elements].wc = 1;
                    ++elements;
                }
    I've set elements = 1 this time around, and the program at least enters the for loop, instead of failing instantly. It seems as if the strcmp function is causing a segmentation fault. Terribly confused by this.
    Last edited by arevans; 02-23-2014 at 09:15 AM.

  7. #7
    Registered User
    Join Date
    Jan 2009
    Location
    Australia
    Posts
    375
    Yeah I didn't read close enough the first time. That code makes absolutely no sense.

    Look at it conceptually. The two components of the body of the loop do entirely different things.

    Also you're incrementing elements in 2 places which makes your code exceptionally hard to follow and the for loop is no longer a bounded loop like I was saying. In theory that loop could run forever (and to be honest it probably does on most inputs).

  8. #8
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I suggest that you break down your program into smaller functions that do one thing and do it well, e.g.,
    • Write a function that extracts just one word from the input stream and stores it in a word struct.
    • Write a function that transforms the uppercase letters in the given string into lowercase.
    • Modify the word extraction function to call the uppercase to lowercase function.
    • Write another function that calls the word extraction function and stores the result into the word/count map, extracting all the words in the input, with the assumption that each word only appears once in the input.
    • Write a function that searches the word/count map for the given word, returning its count (with the possibility of 0).
    • Modify the function that populates the word/count map to use the word/count map search function to find if the word extracted already exists. If it does, update the count; otherwise, store the new word into the word/count map with a count of 1.
    • Write a function to write the word/count map to file in CSV format.

    At each step, test to make sure that the function works as expected before going on to the next step.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  9. #9
    Registered User
    Join Date
    Feb 2014
    Posts
    4
    I've finally managed to solve my segmentation fault problems, and I'm now dealing with an infinite loop. Slow but steady progress, though.

    Thank you all for your help, it's very much appreciated!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Word Counting program not understanding
    By thannara123 in forum C Programming
    Replies: 6
    Last Post: 08-23-2013, 07:36 PM
  2. Strings and file word counting Program
    By hotshotennis in forum C Programming
    Replies: 3
    Last Post: 05-23-2012, 01:28 PM
  3. Word Counting Program
    By psv1120 in forum C Programming
    Replies: 11
    Last Post: 02-23-2011, 03:57 PM
  4. word counting program
    By mashour06 in forum C Programming
    Replies: 2
    Last Post: 06-09-2009, 03:58 AM
  5. Word Counting Program Help
    By rdave1 in forum C++ Programming
    Replies: 1
    Last Post: 09-14-2005, 04:30 PM