Thread: Counting Number of Words in a Text File Using C

  1. #1
    Registered User
    Join Date
    Sep 2005
    Posts
    26

    Counting Number of Words in a Text File Using C

    Hello,

    I'm trying to develop a program that will enable me to count the number of words in a text file. As a plus, I'd like to be able to count how many different words there are too. I have a decent start on the program, but am quite unsure of where to move from here. I know that I need to malloc space for the array, but am not sure how to. Also, I believe that strlen may come into play.

    Any help would be greatly appreciated.

    Thanks,
    James

    Code:
    #include <stdio.h> 
    #include <stdlib.h> 
    #define MAXWORDS 4000 //less than 4000 total words in the 
    //text file 
    char *word[MAXWORDS]; 
    int wordcount[MAXWORDS]; 
    #define MAXWLEN 30 //no words larger than 30 characters 
    char buff[MAXWLEN]; 
    int nwords, totalwords; 
    main() { 
    int i; 
    while(get_word(buff)) { 
    
    /**** The part where I am stuck on ****/ 
    
    
    } 
    for(i = 0; i < nwords; i++) 
    totalwords += wordcount[i]; //if I keep getting 
    //words, the loop will 
    //continue 
    
    printf("there were %d different words out of %d totalwords\n", 
    nwords, totalwords); 
    } 
    
    //-----ignore the section below, it defines what a word is to the 
    //-----program 
    #include <ctype.h> 
    /* Leave this routine EXACTLY as it stands */ 
    int get_word(char *s) { /* s is where to make the string */ 
    int c; 
    do { /* skip non alpha chars */ 
    c = getchar(); 
    if(c == EOF) 
    return(0); /* end of file marker means no word */ 
    } while(!isalpha(c) && !isdigit(c)); 
    do { /* string is consecutive alpha num */ 
    if(isupper(c)) 
    c = tolower(c); 
    *s++ = c; 
    c = getchar(); 
    } while(isalpha(c) || isdigit(c)); 
    *s = 0; /* null terminate the string */ 
    return(1); /* indicate that I have a word 
    */ 
    }

  2. #2
    CS Author and Instructor
    Join Date
    Sep 2002
    Posts
    511
    You need to open the file first- then write to it.

    A word is delimited by spaces.

    You would also need a counter.

    Mr. C.
    Mr. C: Author and Instructor

  3. #3
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    You need to open the file first- then write to it.
    Why in the world would he need to write to a file when all he has to do is count the number words in the file?

  4. #4
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Because this is Mister C we're talking about here, and everyone who knows anything knows not to listen to him.


    Quzah.
    Hope is the first step on the road to disappointment.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Step 1 is to post code which is readable - it may be readable in your editor, but check the board you see an unindented mess.
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #define MAXWORDS 4000           //less than 4000 total words in the
    //text file
    char *word[MAXWORDS];
    int wordcount[MAXWORDS];
    #define MAXWLEN 30              //no words larger than 30 characters
    char buff[MAXWLEN];
    int nwords, totalwords;
    main()
    {
        int i;
        while (get_word(buff)) {
    
        /**** The part where I am stuck on ****/
    
    
        }
        for (i = 0; i < nwords; i++)
            totalwords += wordcount[i];
    
        printf("there were %d different words out of %d totalwords\n",
               nwords, totalwords);
    }
    
    //-----ignore the section below, it defines what a word is to the
    //-----program
    #include <ctype.h>
    /* Leave this routine EXACTLY as it stands */
    int get_word(char *s)
    {                               /* s is where to make the string */
        int c;
        do {                        /* skip non alpha chars */
            c = getchar();
            if (c == EOF)
                return (0);         /* end of file marker means no word */
        }
        while (!isalpha(c) && !isdigit(c));
    
        do {                        /* string is consecutive alpha num */
            if (isupper(c))
                c = tolower(c);
            *s++ = c;
            c = getchar();
        }
        while (isalpha(c) || isdigit(c));
    
        *s = 0;                     /* null terminate the string */
        return (1);                 /* indicate that I have a word */
    }
    Only use spaces for indenting; set your editor to use "spaces for tabs".

    > /**** The part where I am stuck on ****/
    Lets see - you need to have something which
    - searches the array of words you have so far to see if exists already
    - if it doesn't, add it and set the count to 1
    - if it does, bump the count

    > I know that I need to malloc space for the array, but am not sure how to. Also, I believe that strlen may come into play.
    Yes and Yes - so give it a go.
    Plenty of prior examples on the board for the inquisitive.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    Sep 2005
    Posts
    26
    Hi everyone,

    Thank you for your replies. First off, sorry for the messy formatting of my program. Secondly, I do not need to open the file up within the C program. I will be sending the data into the file within Cygwin (basically a Unix emulator).
    gcc thisprogram.c
    ./a < text file.txt)

    So all I really need help with is malloc'ing the space and the array setup. I'll take a look around the boards, but in the meantime, any help is, as always, greatly appreciated.

    Thanks,
    James

  7. #7
    Registered User
    Join Date
    Sep 2005
    Posts
    26
    Does anyone have any further input. I've browsed the forums, but did not find any posts regarding malloc or strlen in regards to the way I plan on using them. There were some trying to do what I am (count words), but in a different way that did not help me much.

    Thanks,
    James

  8. #8
    Registered User
    Join Date
    Aug 2005
    Posts
    1,267
    Here is one way to do it, depending on how you define a "word". If all you want to use is stdin, then just replace FILE* fp = fopen("\\tmp\\constitution.txt","r"); with FILE* fp = stdin;

    Code:
    #include <stdio.h>
    #include <string.h>
    
    #define SZARRAY(x) sizeof(x)/sizeof(x[0])
    
    int find_array(const char*word, const char**word_array,int nItems)
    {
    	for(int i = 0; i < nItems; i++)
    	{
    		if(strcmp(word_array[i],word) == 0)
    			return 1;
    	}
    	return 0;
    }
    int main(int argc, char* argv[])
    {
    	char* word_array[1000] = {0}; // room for 1000 strings
    	char word[1024];
    	int word_counter = 0;
    	FILE* fp = fopen("\\tmp\\constitution.txt","r");
    	if(fp == 0)
    	{
    		printf("file cannot be opened\n");
    		return 1;
    	}
    	while( fscanf(fp,"%s",word) > 0 && word_counter < SZARRAY(word_array))
    	{
    		if( !find_array(word,(const char**)word_array,word_counter))
    		{
    			word_array[word_counter] = strdup(word);
    			++word_counter;
    		}
    		
    
    	}
    
    	fclose(fp);	
    	printf("unique words = %d\n",word_counter);
    	return 0;
    }
    Last edited by Ancient Dragon; 09-26-2005 at 08:35 AM.

  9. #9
    Registered User
    Join Date
    Sep 2005
    Posts
    26
    Thank you everyone for your responses.

    However, I'd like to keep as much of my original code as possible. I need to malloc space because I will be using this with text files of varying sizes (I'm not sure how to do so). Secondly, I need to incorporate strlen, which I am also unsure of. Lastly, the loop to count the words is kind of over my head as well. I volunteer at the Boys and Girls Club and the IT Director wants me to keep it in the form of the code I provided, which he understands (I know, he's a little unqualified but he does his job well).

    Thanks,
    James

  10. #10
    Registered User
    Join Date
    Aug 2005
    Posts
    1,267
    you could use just parts of the code I posted and plug it into your own program. For example, look up what strdup() does and you will see that it uses malloc to allocate the string then copy the original string into that memory. it does the same thing as this
    Code:
    char *string = malloc(strlen("Hello")+1);
    strcpy(string,"Hello");

  11. #11
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    Hmm, writing a word counter is easy, but unique words is a bit harder.

    I'd like to make a side note however, on the definition of a word. In your function, you're using a combination of isdigit() and isalpha() to determine if that character is part of a word. It a fine start, but what if I wrote, "I am Bob's friend." Since isalpha tests for [A-Za-z] and isdigit [0-9], the word "Bob's" will appear as two, due to the appostrophe. Periods, commas, and other marks may have similar effects. You must account for multiple whitespace as well. (Such as the sequence ".\n\t", very common in a text document. (Sentence ending, newline, tab/new paragraph.)) I've known some people to use double spaces between sentences. A custom function in place of isalpha() and isdigit() might be needed here.

    Just something to think about.
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  12. #12
    Registered User
    Join Date
    Sep 2005
    Posts
    12
    This function returns the actual size of the file.i hope this will help u
    Code:
    ulong findlengthof(FILE *fp)
    {
    	ulong start,end;
    	start = ftell(fp);
    	fseek( fp, 0, SEEK_END );
    	end = ftell(fp);
    	fseek( fp, start, SEEK_SET );
    	return ((ulong)end);
    }

  13. #13
    Registered User
    Join Date
    Sep 2005
    Posts
    26
    Hi everyone,

    I've made some progress, but I'm getting incorrect word counts. Can anyone check out my code and see what I might be doing wrong? It seems that it may be breaking out of a loop somewhere. The wordcounts are there on the output, just too low.

    Thanks.

    PS Thanks for all help thus far

    Code:
    #include <stdio.h> 
    #include <stdlib.h> 
    #define MAXWORDS 4000 
    char *word[MAXWORDS]; 
    int wordcount[MAXWORDS]; 
    #define MAXWLEN 30 
    char buff[MAXWLEN]; 
    int nwords, totalwords; 
    main() { 
    int i; 
    while(get_word(buff)) { 
    
    for(i = 0; i < nwords; i++) 
    if(!strcmp(buff, word[i])) 
    wordcount[i]++; 
    
    word[i] = (char *) malloc( strlen(buff) + 1); 
    strcpy(word[i], buff); 
    wordcount[i] = 1; 
    nwords++; 
    } 
    for(i = 0; i < nwords; i++) 
    totalwords += wordcount[i]; 
    printf("there were %d unique words out of %d totalwords\n", 
    nwords, totalwords); 
    } 
    
    //*************I've deleted the code that tells the compiler what a word is, I don't need help on that, and also to make 
    the things I need help with easier to read

  14. #14
    Registered User
    Join Date
    Aug 2005
    Posts
    1,267
    Code:
    #include <stdio.h> 
    #include <stdlib.h> 
    #define MAXWORDS 4000 
    char *word[MAXWORDS]; 
    int wordcount[MAXWORDS]; 
    #define MAXWLEN 30 
    char buff[MAXWLEN]; 
    int nwords = 0, totalwords = 0; 
    main() { 
    int i; 
    while(get_word(buff)) { 
    //this loop checks if the word in buff is already in the
    // word array
    int found = 0;
    for(i = 0; i < nwords; i++) 
    {
       if(!strcmp(buff, word[i])) 
       {
          wordcount[i]++; 
          found = 1;
          break;
       }
    }
    // if not in the array, then insert it
    if( !found )
    {
       word[nwords] =  malloc( strlen(buff) + 1); 
       strcpy(word[nwords], buff); 
       wordcount[nwords] = 1; 
       nwords++; 
    }
    } 
    for(i = 0; i < nwords; i++) 
       totalwords += wordcount[i]; 
    printf("there were %d unique words out of %d totalwords\n", 
        nwords, totalwords); 
    }

  15. #15
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > the things I need help with easier to read
    You can start with learning how to indent code.

    Or maybe configure your editor to use spaces for indenting (not tabs), so that formatting is preserved when you post your code.

    At the very least, learn to press "preview" before post to make sure your code is presentable.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Formatting the contents of a text file
    By dagorsul in forum C++ Programming
    Replies: 2
    Last Post: 04-29-2008, 12:36 PM
  2. Issue w/ Guess My Number Program
    By mkylman in forum C++ Programming
    Replies: 5
    Last Post: 08-23-2007, 01:31 AM
  3. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 09:54 AM
  4. Ok, Structs, I need help I am not familiar with them
    By incognito in forum C++ Programming
    Replies: 7
    Last Post: 06-29-2002, 09:45 PM
  5. simulate Grep command in Unix using C
    By laxmi in forum C Programming
    Replies: 6
    Last Post: 05-10-2002, 04:10 PM