Thread: Help with finding word frequency in a text file.

  1. #1
    Registered User
    Join Date
    Apr 2010
    Posts
    6

    Question Help with finding word frequency in a text file.

    I'm pretty new to this forum, although I have run into cprogramming from time to time when Googling. That may make me a lurker. Anyways:
    I need help with a homework assignment, which I haven't been able to make much headway with. No, I am not asking someone to do the whole assignment for me. I'm looking for someone to give me some code that points me in the right direction.

    My Work:
    I have several theories, each of which seems more unlikely than the last. I first created a program to read the entire text file, character by character into terminal. I then made the program count the number of spaces in the text file. It was my hope that I could modify this program to create a list of all the words that appeared in the text file, since it could recognize spaces. Here is my code:
    Code:
    #include <stdio.h>
    
    int spaces = 0;
    
    int main ( int argc, char *argv[] )
    {
        if ( argc != 2 ) 
        {
            printf( "usage: %s filename", argv[0] );
        }
        else 
        {
            FILE *file = fopen( argv[1], "r" );
    
            if ( file == 0 )
            {
                printf( "Could not open file\n" );
            }
            else 
            {
                int x;
    			int i;
    
                while  ( ( x = fgetc( file ) ) != EOF )
                {
                    printf( "%c", x );
    				if (x ==' ')
    			{	
    			spaces++;
    			}
                }
    		
            }
            fclose( file );
    		printf ( "\n\n The number of spaces contained within the file is: %d", spaces ); 
        }
    }
    I don't know if this approach will work or not.
    I also attempted to modify this example code to halt at each space, and concantate the characters found so far into another variable/string "z".

    Code:
    #include <stdio.h>
    #include <string.h>
    
    int main() {
      FILE *file;
      char *c; /* make sure it is large enough to hold all the data! */
      char b[200];
      char *z;
      char *d;
      int n;
    
      file = fopen("zzz.txt", "r");
    
      if(file==NULL) {
        printf("Error: can't open file.\n");
        return 1;
      }
      else {
        printf("File opened successfully.\n");
       
    //    n = fread(c, 1, 10, file); /* passing a char array, 
    //                                  reading 10 characters */
    //    c[n] = '\0';               /* a char array is only a 
    //                                  string if it has the
    //                                  null character at the end */
    //    printf("%s\n", c);         /* print out the string      */
    //    printf("Characters read: %d\n\n", n);
    //
    //    fclose(file);          /* to read the file from the beginning, */
    //                           /* we need to close and reopen the file */
    //    file = fopen("numbers.txt", "r");
    	
    	while(1) {     /* keep looping... */
          c = fgetc(file);
          if(c!=" ") {
    	  z = strcat(z,c);
    		printf("%z\n", c);  
    		 }
    	  else {
            continue;     /* ...break when EOF is reached */
          }
    	  if(c==EOF);
    	  {
    	  break; 
          }
    	}
    //    n = fread(d, 1, 10, file);
                /* passing a char pointer this time - 10 is irrelevant */
        
    	
    //	printf("%s\n", d);
    //    printf("Characters read: %d\n\n", n);
    
        fclose(file);
        return 0;
      }
    }

    It was my hope that stopping at each space and appending the letters found so far into a variable/string/char would create a list of words for me - not worrying about whether it was unique yet or not. Then I could write some more code to check how many times the contents of each line appeared uniquely, thus generating my list of words and the number of times they appear.

    Somebody's going to tell me that I'm trying to reinvent maps or structs or that I can't typecast to save my life. In truth, I don't fully understand these functions too well. I happily understand every function in the second chunk of code up there, and I'll cheerfully learn about whichever functions can help me.

    That said, if you give me code, and you do 1, 3, or 4 differently, please explain a little bit. Because I'm familiar with things like fopen, fclose, only.
    1. Open a text file.
    2. Parse the file for words. <--- need help
    3. Print to terminal.
    4. Close the file.

    Any help, including criticism at my approach, is appreciated.

    AA.
    Last edited by aeolusaether; 04-04-2010 at 12:13 PM.

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    So reading in one word at a time is okay. (I.e., reading in character-by-character until you see a space is fine I suppose, but why not use %s with scanf?)

    You need to store the words, and since you don't know how many words there are in the file, arrays are Right Out. You can make it a list, or some sort of hash table, or whatever you want really. But you will need to store them in a reasonable way. (Then later you can sort/count/sort again.)

    If you were in C++, all those "containers" (such as maps and lists) are built-in so you don't have to write them yourself. If you stick with C, then you need to come up with a data structure and the containers.

  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Sometimes counting words by counting just spaces will fail.
    Can you see why?
    (hint, hint)

    Taking the char's one by one is fine, but stop at a period, question or exclamation mark, or at a space.

    To save space, you may want to write out all the words to a file, one word per line. When you've done that, just sort the words (easy smeazy to do), and then they'll be duck soup to count, since they'll be right next to each other in the sorted list.

    I disagree with Tabstops answer, completely. This isn't an assignment to make sophisticated containers in C. This is an assignment to use common sense - and no sophisticated containers are necessary, in any language.

    And no, I wouldn't touch %s with scanf() for anything. What will happen the first time a word has a number in it, etc. ?
    Last edited by Adak; 04-03-2010 at 04:57 PM.

  4. #4
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    How would you count the words in this sentence then, if you aren't storing the words some place, Adak?


    Quzah.
    Hope is the first step on the road to disappointment.

  5. #5
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by Adak View Post
    Sometimes counting words by counting just spaces will fail.
    Can you see why?
    (hint, hint)

    Taking the char's one by one is fine, but stop at a period, question or exclamation mark, or at a space.

    To save space, you may want to write out all the words to a file, one word per line. When you've done that, just sort the words (easy smeazy to do), and then they'll be duck soup to count, since they'll be right next to each other in the sorted list.

    I disagree with Tabstops answer, completely. This isn't an assignment to make sophisticated containers in C. This is an assignment to use common sense - and no sophisticated containers are necessary, in any language.
    I'm curious as to why you think a "file" isn't "a container to store things in". It, obviously, wasn't my first choice, but it fits my advice to a T. (EDIT: Although the containers of list or map have an advantage of being useful -- you can sort a list. You can't sort a file, except by, well, putting all those words into a container in memory of some sort (here we go again), sorting that container, and then writing that container to a (same or different) file. You can't delete something from a file (except by making a new file). Etc.)

    Quote Originally Posted by Adak View Post
    And no, I wouldn't touch %s with scanf() for anything. What will happen the first time a word has a number in it, etc. ?
    It will read it as though it was a word. What do you think would happen?
    Last edited by tabstop; 04-03-2010 at 06:26 PM.

  6. #6
    Registered User
    Join Date
    Apr 2010
    Posts
    6
    tabstop, do you think you could provide me with a snippet or an example of using scanf to pull text from a text file? I believe I will be going with a list, since I can't use an array. Mapping seems like it might or might not work.

    Adak, do I have to write a series of if statements and else if statements to incorporate all the possible punctuation marks as well as spaces? Or is there a way to do something like if(c==" " OR "!" OR "?") ... You probably can't use booleans in the middle of a function. I'm probably going to end up writing like five different if statements, then.

    I'm looking these things up myself of course, but I was wondering if anyone could show me how to grab word from a line (after stopping at a space) and stick it into a variable or a list which can be printf'ed and sorted later...
    Last edited by aeolusaether; 04-03-2010 at 06:56 PM.

  7. #7
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by aeolusaether View Post
    tabstop, do you think you could provide me with a snippet or an example of using scanf to pull text from a text file?
    Code:
    fscanf(file_with_things_in_it, "%80s", character_array_of_size_at_least_81_characters);
    80 was chosen somewhat arbitrarily.

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by aeolusaether View Post
    tabstop, do you think you could provide me with a snippet or an example of using scanf to pull text from a text file? I believe I will be going with a list, since I can't use an array. Mapping seems like it might or might not work.

    Adak, do I have to write a series of if statements and else if statements to incorporate all the possible punctuation marks as well as spaces? Or is there a way to do something like if(c==" " OR "!" OR "?") ... You probably can't use booleans in the middle of a function. I'm probably going to end up writing like five different if statements, then.

    I'm looking these things up myself of course, but I was wondering if anyone could show me how to grab word from a line (after stopping at a space) and stick it into a variable or a list which can be printf'ed and sorted later...
    Nothing against booleans, but if you can count to 1 you don't need them, here.

    Code:
        i = num_char = 0;    
        while((c=getc(in))!=EOF) {        //in is the in File pointer          
           if(c == '\n')                            //increment line number
              ++num_lines;                    
           if(isalpha(c) || c == '\'' )  {          //isalpha and apostrophe
              ++num_char;                           //char's are checked, counted
              buff[i++] = c;                        //and put into buff[]
              inWord = 1;                          //we're inside a word 
           }
    buff[] is a previously declared char array, 35 char's long. Holds just one word.

    If you include ctype.h, you will need two if statements. One will have an else part to it.


    @Quzah:
    Did you miss this part of my post?
    To save space, you may want to write out all the words to a file, one word per line.
    @Tabstop:
    Your soap box about C++ containers was unwarranted, and misleading. No, a file is not a C++ container.

    I would not recommend using scanf() to read in and store words from, an unknown text file.
    Last edited by Adak; 04-04-2010 at 01:26 AM.

  9. #9
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by aeolusaether
    Count the number of times each word in a text file appears.
    One of the first things you need to be clear about is what is a word. As Adak has indicated, you might want to treat punctuation as not part of a word... but then would you treat a punctuation mark as a word by itself? What about numbers? Or perhaps you say that a word is simply a sequence of non-whitespace characters.

    Quote Originally Posted by Adak
    I disagree with Tabstops answer, completely. This isn't an assignment to make sophisticated containers in C. This is an assignment to use common sense - and no sophisticated containers are necessary, in any language.
    It so happens that using std::map to solve this in C++ does not involve making a "sophisticated container", and yet is easier and more space efficient than what you suggested, with comparable time complexity.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  10. #10
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    @OP: You may want to look at the is_____ functions in the <ctype.h>/<cctype> header. isalpha() is nice, as is ispunct().

    @Adak: I'll ask it again: once you've written all the words to a separate file, how do you intend to sort that file without reading it all into memory? If you do read it into memory, won't you need some sort of way to store all that data? If you have a way to store that data, why not read it into that object to start with? Is writing a linked list in C really that hard?

  11. #11
    Registered User
    Join Date
    Apr 2010
    Posts
    6
    @tabstop: Going off of what you said about punctuation, here's this:

    Code:
    void remove_punct_and_make_lower_case(char *buf)
    {
        char *src = buf, *dst = buf;
    
        while (*src)
        {
           if (ispunct((unsigned char)*src))
           {
              /* Skip this character */
              src++;
           }
           else if (isupper((unsigned char)*src))
           {
              /* Make it lowercase */
              *dst++ = tolower((unsigned char)*src);
              src++;
           }
           else if (src == dst)
           {
              /* Increment both pointers without copying */
              src++;
              dst++;
           }
           else
           {
              /* Copy character */
              *dst++ = *src++;
           }
        }
    
        *dst = 0;
    }
    @laserlight. Aha, you've given me the key to this assignment....

    Original Question, C++:

    Code:
    int main(int ac, char** av)
    {
    if(ac != 2)
    {
    cout << "Usage: " << av[0] << " filename" << endl;
    return 1;
    }
    cout << "Reading file " << av[1] << endl;
    ifstream f(av[1]);
    
    // read and count words
    istream_iterator<string> i(f);
    multiset<string> s(i, istream_iterator<string>());
    
    // sort by count
    multimap<size_t, string> wordstats;
    for(multiset<string>::const_iterator i = s.begin(); i != s.end(); i = s.upper_bound(*i))
    wordstats.insert( make_pair( s.count(*i), *i ));
    
    // output in decreasing order
    for( multimap<size_t, string>::const_reverse_iterator i = wordstats.rbegin(); i != wordstats.rend(); ++i)
    cout << " word " << i->second << " found " << i->first << " times " << endl;
    }


    Resolved.
    Thank you to tabstop, Adak, laserlight, and quzah.

  12. #12
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by tabstop View Post
    @OP: You may want to look at the is_____ functions in the <ctype.h>/<cctype> header. isalpha() is nice, as is ispunct().

    @Adak: I'll ask it again: once you've written all the words to a separate file, how do you intend to sort that file without reading it all into memory? If you do read it into memory, won't you need some sort of way to store all that data? If you have a way to store that data, why not read it into that object to start with? Is writing a linked list in C really that hard?
    I'll sort it with the system. A lot of people don't know that Windows2000 and up, has special resources for sorting, that user's programs don't have.

    I don't need anything to store, except a file. The file will be the list of names at first, and then later, the file will be the list of names, with the number of times the names are repeated and their frequency in percent, in front of their name.

    The list of names will be sorted twice - first alphabetically, so the repeated names can be easily counted, and then a second time, so the most repeated names will be brought to the top of the list, as the assignment requires.

    You're probably thinking "This is going to take a very long time." But that's wrong - very wrong.

    I just made up a list of 250,000 names (from baby name lists), and had ptime (a pc-tools timer utility program), time the sort. Since the names list was increased in size by copying about 20,000 names, there are LOTS of repetitions, and LOTS of blocks of 50 names or so, already in sorted order (but the whole block is out of sorted order).

    I used this command line on WindowsXP (32 bit):
    ptime sort names1.txt >newnames.txt


    This was the result:
    ptime 1.0 for Win32, Freeware - PC-Tools.Net: Tools and utilities for Windows, Unix/Linux, DOS
    Copyright(C) 2002, Jem Berkes <[email protected]>

    === sort names1.txt ===

    Execution time: 1.576 s

    This is on an E6700 cpu at 2.66 GHz, not overclocked.

    So all the sorting will take just over 3 seconds. You can't touch that with a linked list, and
    note the total lack of any fancy C++ containers.

    It couldn't be simpler, faster, and use less memory.

    I've loaded the names file "names1.txt" to this free hosting site:
    http://www.filedropper.com/names1

    in case you want to give it a try.
    Last edited by Adak; 04-04-2010 at 03:54 PM.

  13. #13
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Ha ha! Oh, that's sheer brilliance there! Why not go all out:
    Code:
    sort names1.txt | uniq -c | sort -g -r > output.txt
    while we're at it?

  14. #14
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    It won't do everything he wants, so some programming will be needed, but nothing you can program will be easier and faster.

  15. #15
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    I doubt your example will do everything he wants either. Your "list of names" doesn't consider punctuation, does it? As a matter of fact, it more than likely consideres "dog" at the end of a sentence a different word than "dog" in the middle of a sentence.

    Edit: However, he hasn't really given us very good guidelines. Is 'dog' different than 'Dog'?


    Quzah.
    Hope is the first step on the road to disappointment.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. I'm not THAT good am I?
    By indigo0086 in forum A Brief History of Cprogramming.com
    Replies: 2
    Last Post: 10-19-2006, 10:08 AM
  2. randomly picking a word out of a text file
    By lilrayray in forum C Programming
    Replies: 11
    Last Post: 08-01-2006, 06:26 PM
  3. C++ std routines
    By siavoshkc in forum C++ Programming
    Replies: 33
    Last Post: 07-28-2006, 12:13 AM
  4. Possible circular definition with singleton objects
    By techrolla in forum C++ Programming
    Replies: 3
    Last Post: 12-26-2004, 10:46 AM
  5. spell check in C using a dictionary file
    By goron350 in forum C Programming
    Replies: 10
    Last Post: 11-25-2004, 06:44 PM

Tags for this Thread