Simple help (STRCMP, word counts etc)

**mariojoshi** · 12-08-2004

Hey everybody, i understand that obviously this isnt a place to get your homework done and ive read the thread about it too.

Im currently doing a uni course with some basic C, but abit of it is going over my head -

my task so far has been to write a program that calculates the number of words in a text file. and ive managed this ok...

Code:

/* Document Analyser
	Author - JJ Singer
	Version 1.0
	Bulid Date - 02/12/04

	A program that asks the user for the name 
	of a text file to read, reads it and counts 
	the number of words in the file.  It then 
	will display this number to the user */


#include <stdio.h> //includes input/output command library


void main ()

{	// VARIABLES DEFINED

	int counter, fileend; 
	// counter: stores the number of words in the file,
	// fileend: tells the program when the file has ended
	char content[300], filename[10];
	// content: tells the program the length of the field to expect and its type
	// filename: tells the program the length of the field to expect and its type
	
	FILE *filein;
	// give the program a file pointer to the filein stream

	printf("***************************\n");
	printf("*                         *\n");
	printf("*   DOCUMENT ANALYSER     *\n");
	printf("*   Author - JJ Singer    *\n");
	printf("*      Version 1.0        *\n");
	printf("*  Build Date - 02/12/04  *\n");
	printf("*                         *\n");
	printf("***************************\n\n\n");
	//header 



	printf ("Please type in the name of the file you wish to read and press enter\n\n");
	//dispays the prompt on screen 
	scanf ("%s",filename);
	// makes the program read the text entered (specifically the file name)

	filein=fopen(filename,"r");
	//opens the file stream of the required file
	fileend=fscanf(filein, "%s",content);
	// assigns the integer value of true of false (1/0) to the fileend variable

	while(fileend!=EOF) // loop - counts the number of words in the file
						// while the end of the file is not reached
{
		counter++;
		fileend=fscanf(filein,"%s",content);
}

	printf("Analysing document..................\n\n\n");
	printf("This file contains %d words\n\n",counter);
	// displays the word count on screen 

	fclose(filein);
	// closes the file stream


}

the part i am having problems with is part of the next task which builds on the current code - im required to count the number of UNIQUE words in the text file and give a total of these.

I think the way i should do this is -

using for loops (possibly 2?) and compare two strings (strcmp function) and a counter that increases by one each time non unique words are found. Then by calculating the difference between the unique total and (general) total.

however, im not sure where or how to implement this section.. as i say, its going over my head a little bit.

could anyone put some light on this matter for me?

PS (heard i might need to use a "to lower / to upper " function.. however this may be some variable instead.. not sure)

cheers for any help.
Josh.

**itsme86** · 12-08-2004

What you can do is add an array of strings to your program. Every time you read a word in from your file, loop through that new array and see if the string has already been stored there. If it hasn't, store the string you just read from the file in the array. At the end, display all the strings in the array.

**Thantos** · 12-08-2004

void main ()

http://faq.cprogramming.com/cgi-bin/...&id=1043284376

printf("* Build Date - 02/12/04 *\n");

And you are just now asking questions almost 10 months later?

Now to the fun part

Have you been given any idea on how many unique words may be present in the file? If not then I would keep a link list of words. Everytime you find a unique word you add it to the list. To test for a unique word you compare it against the words already in the list.

**itsme86** · 12-08-2004

Originally Posted by Thantos

printf("* Build Date - 02/12/04 *\n");

And you are just now asking questions almost 10 months later?

I was assuming DD/MM/YY

**Thantos** · 12-08-2004

Hmmm you might be right

**mariojoshi** · 12-08-2004

just to put this straight.. im in the UK... where the standard date stuff is dd/mm/yy!

ritey.. will get on with reading your replies

**Thantos** · 12-08-2004

Dang UKers

**Thantos** · 12-08-2004

Ok recieved this PM from Esto (New Member with 0 posts), since it pertains to this post I'm replying here.

Hey there, I was just reading your reply to the word frequency count. Surely the link list would become corrupted once the file has been used?

Sorry to come across as if I'm criticising, Im just intrigued as to how you would create a word frequency counter of a text file in C.

If you can, show us what you mean.

Thanks,

Pete (Rookie but keen Programmer)

Well first its not a word frequency as we don't care how many times a word shows up (though it would be super easy to make it do that)

I won't write the entire thing but:

Code:

struct Node {
struct Node *next;
struct Node *prev;
struct Node *word;
/* unsigned freq; */ /* in case we want to add that ability in */
};
 
/* return will be NULL on error or the new tail */
struct Node * addnode ( struct Node *tail, char *word )
{
int len=0;
struct Node *temp = NULL;
temp = malloc (sizeof(struct Node));
if ( temp == NULL )
	return NULL;
 
len = strlen(word);
temp->word = malloc(len + 1); /* +1 for null char */
if ( temp->word == NULL )
{
	free(temp);
	return NULL;
}
strcpy(temp->word, word); 
  /* strcpy() should be ok here since there isn't a chance to overrun the array */
tail->next = temp;
temp->prev = tail;
temp->next = NULL;
/* temp->freq = 1; */ /* if we are including that */
return temp;
}

Warning: I have not compiled nor tested the above code. There may be an error in there

Now when you destroy the tree at the end of the program you'll have to free the word before free the Node.

**quzah** · 12-08-2004

Make it a hash table for faster lookups. Use an array of linked lists, and use the word length for the hash.

Quzah.

**itsme86** · 12-08-2004

Originally Posted by quzah

Make it a hash table for faster lookups. Use an array of linked lists, and use the word length for the hash.

Quzah.

Allow for word lists in excess of 4GB, add unicode support, and do it without using strings too?

**Thantos** · 12-08-2004

Well of course the link lists in the hash should be done alphabetical (sp?) order but in the ancient greek alphabet

**quzah** · 12-08-2004

I was actually being serious. If you're storing a file of unknown size, and storing all unique words, and you need to continually check each new word read with the entire list to see if it's unique or not...

Code:

List *table[SOMESIZE];
List *ptr, *nextptr;
char buf[BUFSIZ];

...read word into buffer...
...smash case of buffer...
for( ptr = list[ strlen( buf ) ]; ptr; ptr = nextptr )
{
    nextptr = ptr->next;
    if( strcmp( ptr->word, buf ) == 0 )
        ...word is not unique so stop looking...
    else
    if( ptr->next == NULL )
        ...stick this word onto this list...
    /* else, we're not at the end of the list so we keep going, which as been covered */
}

Quzah.

**itsme86** · 12-08-2004

Of course a hash table is a good idea. I agree. I just meant that if the experience isn't there to count the frequency of words then something that's even more complicated is unlikely to be in the OP's toolkit. But, maybe the concept of a hash table is only more advanced in my eyes.

**xErath** · 12-08-2004

Hash table? Hum.. it would be better to use a Trie Tree. Each node has a maximum of 26 sub-nodes, each one representing a diferent char. When working with dictionaries it is one of the most eficient data structure, and not as failable as the hash table.

http://www.csse.monash.edu.au/~lloyd...gDS/Tree/Trie/

**quzah** · 12-08-2004

One of the reasons I suggested a hash table, is because if you understand a linked list, how much harder is it really to have an array of them? Not much. All you need is a size for your array, and something to decide what slot they fall into.

Quzah.

Thread: Simple help (STRCMP, word counts etc)

Thread Tools

Search Thread

Display

Simple help (STRCMP, word counts etc)

Similar Threads

seg fault at vectornew

C++ Simple Puzzle Word Game

Find a word in a 2d grid

finding strings in strings