Thread: Word Counting

  1. #1
    Registered User
    Join Date
    Oct 2019
    Posts
    20

    Word Counting

    I have been tasked with the following prompt for a coding project:

    Write a C program in Linux that will read two input files, identify frequent common words that appear
    in both files if their number of occurrences are higher than or equal to the specified frequency limit for
    both files, and write them into an output file in decreasing sorted order based on their total frequency
    value.
    For each input file, your program will build a separate input linked list in ascending sorted order
    based on the word field, where strcmp() is used for comparison while building these input lists.
    Each node of the lists will contain a unique word that exists in that file and the number of occurrence
    of that word within that file. Hence, each node of the list will be a
    struct having fields to store the word itself (char *) and its count (int). You will build the lists
    as doubly linked lists, where nodes have both next and previous pointers.
    After building these two input lists, your program will find the common words in both lists if their
    count is larger than or equal to the specified frequency limit in both files, calculate their total count
    considering both files, and build a third output linked list that will be used for printing the result into
    an output file in decreasing sorted order based on the total count value. You must implement and
    use the insertion sort algorithm to sort the output list. If multiple words have the same count, then
    tie-breaking will be done based on the word field in ascending order, again using the strcmp()
    function.

    I have been able to successfully read two separate .txt files and count the words that appear in each as well as incrementing for words that occur multiple times. However, I cannot seem to sort the counts from each file in decreasing order so that when I compare the two lists and choose a frequency number that it would then only display words that have appeared at least a certain number of times.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Quote Originally Posted by bob432 View Post
    However, I cannot seem to sort the counts from each file in decreasing order so that when I compare the two lists and choose a frequency number that it would then only display words that have appeared at least a certain number of times.
    It would seem to me the first sub-task of this is to just find the common words.
    Then worry about the total count.
    Then worry about sorting by count.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Sep 2021
    Posts
    1
    Check out this <<snipped>> site.

  4. #4
    Registered User rstanley's Avatar
    Join Date
    Jun 2014
    Location
    New York, NY
    Posts
    1,110
    Quote Originally Posted by rsneha View Post
    Check out this <<snipped>> site.
    From the original post:
    I have been tasked with the following prompt for a coding project:

    Write a C program in Linux that will read two input files, identify frequent common words that appear
    in both files if their number of occurrences are higher than or equal to the specified frequency limit for
    both files, and write them into an output file in decreasing sorted order based on their total frequency
    value.
    Valid site for a simple word count, but this is a forum for the creation of C programs, and assisting the O/P's with answering their questions, and advising them on their code. Your suggestion does not help the O/P.

  5. #5
    Registered User
    Join Date
    Feb 2019
    Posts
    1,078
    Quote Originally Posted by bob432 View Post
    I have been able to successfully read two separate .txt files and count the words that appear in each as well as incrementing for words that occur multiple times. However, I cannot seem to sort the counts from each file in decreasing order so that when I compare the two lists and choose a frequency number that it would then only display words that have appeared at least a certain number of times.
    What is a 'word' in this context? Take your text, above, as an example: punctuation will be considered as part of the 'word' or your files are only lists of words?

    A better approach would be to use a binary balanced search tree instead of doubly linked lists... But since this is a requirement...

  6. #6
    Registered User rstanley's Avatar
    Join Date
    Jun 2014
    Location
    New York, NY
    Posts
    1,110
    Quote Originally Posted by flp1969 View Post
    What is a 'word' in this context? Take your text, above, as an example: punctuation will be considered as part of the 'word' or your files are only lists of words?

    A better approach would be to use a binary balanced search tree instead of doubly linked lists... But since this is a requirement...
    "The word, word is a word."

    I would assume that any whitespace characters and any punctuation, are not part of any "word". Based on this, my example above, should consist of 6 "words", and the count of the word, "word", should be, 3.

  7. #7
    Registered User
    Join Date
    Feb 2019
    Posts
    1,078
    Quote Originally Posted by rstanley View Post
    "The word, word is a word."

    I would assume that any whitespace characters and any punctuation, are not part of any "word". Based on this, my example above, should consist of 6 "words", and the count of the word, "word", should be, 3.
    Yep... just to demonstrate the point: "The word 'word' is the word.", if not properly dealt with, will give one word of each "The", "the", "'word'", "word" and "word." and "is".

  8. #8
    Registered User
    Join Date
    Oct 2019
    Posts
    20
    Sorry for no replies about my created topic I thought I would get email updates on when someone else replied.

    I have made a lot of progress on this program. And yes each word weather it be "The" "the." One." One" "one" etc.. is considered a separate word. Punctuation and capitalization keep a two words that are spelled exactly the same from incrementing one another, which is how it is supposed to be. I am able to read two input files, listing the total count of each different word next to it. However, I need to be able to take an input value, say like 10, and only display words that occur at least 10 times in the files. I also need to sort the output list in ascending order but I will worry about that once I can use the input value to display only words occurring a specific number of times.

    For better context I am taking 4 arguments when running the program :
    Code:
        char *inputFileName1 = argv[1];
        char *inputFileName2 = argv[2];
        char *outputFileName = argv[3];
        int wordcount = argv[4];
    I need the wordcount argument to be the value that the wordcount has to be greater than or equal to but for now I just want to understand how to display only words that occur based on the chosen wordcount value.

    I believe it would fit into this section of code:
    Code:
        outputFile = fopen (outputFileName, "w+");        //first parameter is input file
        if (outputFile == 0) 
        {
            printf ("Failed to open output file.. \n");
            return 1;
        } else 
        {
            printf ("Successfully opened output file. \n");
        }
    
    
        currentWord = wordptr;
    
    
        while (currentWord != NULL) 
        {  //just test currentWord here
            //add word name then word count to file, then move to next
            fprintf (outputFile, "%s,%d \n", currentWord->str, currentWord->count);
            printf ("%s",currentWord->str);
            currentWord = currentWord->next;
        }
    
    
        putchar ('\n');
        return 0;
    Last edited by bob432; 09-07-2021 at 08:23 PM. Reason: Code added

  9. #9
    Registered User rstanley's Avatar
    Join Date
    Jun 2014
    Location
    New York, NY
    Posts
    1,110
    Quote Originally Posted by bob432 View Post
    Sorry for no replies about my created topic I thought I would get email updates on when someone else replied.

    I have made a lot of progress on this program. And yes each word weather it be "The" "the." One." One" "one" etc.. is considered a separate word. Punctuation and capitalization keep a two words that are spelled exactly the same from incrementing one another, which is how it is supposed to be.
    Why do you think it is, "how it is supposed to be"? Where in the instructions does it indicate that punctuation, such as a comma, or a period is part of a "word"??? or where does it indicate that initial capitalization differentiates "One" from "one"? Perhaps you should ask the instructor to clarify what constitutes a "word".

  10. #10
    Registered User
    Join Date
    Oct 2019
    Posts
    20
    Quote Originally Posted by rstanley View Post
    Why do you think it is, "how it is supposed to be"? Where in the instructions does it indicate that punctuation, such as a comma, or a period is part of a "word"??? or where does it indicate that initial capitalization differentiates "One" from "one"? Perhaps you should ask the instructor to clarify what constitutes a "word".
    Apologies, I didn't include the entire instruction document as its 5 pages of information and it seemed crazy to include the entire document. It was stated in class that any differences in words dictates that it be counted separately, this includes any capitalization and punctuation that may be in each word. So, "one" and "One" would be counted separately, as well as "word." and "word" and "Word", as each one either has punctuation, or capitalization differences. Words are considered any number of letters or symbols not separated by a space, so even "zxcvb" would be considered a word even though it is not in a dictionary.

  11. #11
    Registered User rstanley's Avatar
    Join Date
    Jun 2014
    Location
    New York, NY
    Posts
    1,110
    Quote Originally Posted by bob432 View Post
    Apologies, I didn't include the entire instruction document as its 5 pages of information and it seemed crazy to include the entire document. It was stated in class that any differences in words dictates that it be counted separately, this includes any capitalization and punctuation that may be in each word. So, "one" and "One" would be counted separately, as well as "word." and "word" and "Word", as each one either has punctuation, or capitalization differences. Words are considered any number of letters or symbols not separated by a space, so even "zxcvb" would be considered a word even though it is not in a dictionary.
    "5 pages of information" for one assignment??? Seriously? One page at most, should have been sufficient! Your instructor needs to take a course on assignment creation! I also disagree with the inclusion of punctuation in a "word", and possibly, punctuation as well. Good luck!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Word counting
    By Shankar k in forum C Programming
    Replies: 9
    Last Post: 06-26-2015, 12:28 AM
  2. Need help on counting word.
    By Alexie in forum C++ Programming
    Replies: 2
    Last Post: 01-19-2013, 09:16 AM
  3. Word counting
    By yuuh in forum C Programming
    Replies: 2
    Last Post: 08-09-2009, 11:47 PM
  4. Word Counting
    By cookie in forum C Programming
    Replies: 18
    Last Post: 06-17-2007, 12:31 PM
  5. Word Counting
    By Achillles in forum C++ Programming
    Replies: 9
    Last Post: 09-11-2002, 02:09 PM

Tags for this Thread