need help--data mining from file (some quirks)

**jayrobers** · 10-12-2005

hello all-

i'm a student (not too many on here huh?

) and my assignment is to take a text file as input and perform data mining operations on it. more specifically, it needs to find the pair of words with the highest correlation index. the CI is found through the equation: 2 x (# of times both words appear in the same sentence ) / (# of times word A appears + # of times word B appears). that's all small business; i just wanted to let you know the details.

now, here's my quirk: i am NOT allowed to use the string.h and ctype.h libraries or functions.

basically, i don't know how to parse the input. the teacher suggested that we keep count of the words in the input, the number of occurences of each word, and the sentence index of each word. a sentence ends with ?, !, or . and a word-separator can be a blank space, /t, /n, ',' . : ; ? ! or -. each line will have no more than 1000 characters. the number of sentences will not exceed 30000. the number of words in a sentence will not exceed 1000. the number of words will not exceed 1000. the sum of the lengths of all distinct words will not exceed 10000. each word will occur in no more than 20 sentences. a word will not occur more than once in a sentence. the number of pairs that constitute a solution will not exceed 100. (some of that was copy/pasted from my spec sheet)

here's what i would like to do; please help me get started or let me know if i should do it another way.

1. scan in one line at a time- go from the start until input reaches one of the line terminators (read differently from a word seperator). place that into a char* or similar.

2. tokenize that line- implement the word seperators to find each word, get the index and increase word counts, store each word.

3. compare words- basically find if a word is repeated in another sentence.

4. repeat

ex: if the input is "see dog run! dog eats food for energy to run.", the pair of words would be "dog" and "run". (i didn't do the math for the CI). the program outputs the two words and the CI. there are special cases for a tie between many words but that's just an output modification.

**Prelude** · 10-12-2005

>here's my quirk: i am NOT allowed to use the string.h and ctype.h libraries or functions.
Write your own.

**itsme86** · 10-12-2005

Another way might be to read the file in a character at a time and store words in an array of strings one sentence at a time. Logic might be something like this:

Code:

While not at the end of the file
  Read in a character
  If character is a sentence terminator
    process the sentence array and reset
  else
     if character is a word terminator
       Add terminator character to sentence[word_index][character_index]
       Increment word_index
    else
      Add that character to sentence[word_index][character_index]
      Incrememt character_index

**jayrobers** · 10-12-2005

prelude- you can bet i tried

but i'm guessing the point of this assignment wasn't to see who could write the best functions so after a while i hunkered down and got to starting it out the right way

itsme86- thanks for that pseudocode, that's actually a great way to get me in the right direction.

a few more questions, basing off what itsme wrote:

1. would i need to compare every word in the first sentence to the next sentence, and so on? or, in other words, how would i go about checking to see how many times a pair of words appear together in a sentence?

2. other than a malloc() with the given size limits, is there another way to create the storage arrays? memory is not really an issue but i was just wondering about efficiency and time.

thanks for all the help so far

**itsme86** · 10-12-2005

Well, you have the "will not exceed" expectations so you could just create your array using those sizes.

I guess you can keep track of pairs of words and how often they appear or you can read the entire file in and then do your calculations. You can expand the 2d array into a 3d one. Something like: document[sentences][words][characters].

There's no one way to do it. Pick a way you like and run with it. Recognizing inefficiencies in algorithms and finding better ways to do something is all part of the learning process.

**quzah** · 10-12-2005

Originally Posted by jayrobers

2 x (# of times both words appear in the same sentence ) / (# of times word A appears + # of times word B appears)

Hm... ok...

Originally Posted by jayrobers

basically, i don't know how to parse the input. the teacher suggested that we keep count of the words in the input, the number of occurences of each word, and the sentence index of each word. a sentence ends with ?, !, or . and a word-separator can be a blank space, /t, /n, ',' . : ; ? ! or -. each line will have no more than 1000 characters. the number of sentences will not exceed 30000. the number of words in a sentence will not exceed 1000. the number of words will not exceed 1000. the sum of the lengths of all distinct words will not exceed 10000. each word will occur in no more than 20 sentences. a word will not occur more than once in a sentence. the number of pairs that constitute a solution will not exceed 100. (some of that was copy/pasted from my spec sheet)

So in other words, just make your function:

Code:

float theanswer( FILE *fp )
{
    fclose( fp );
    return ( 2.0f * ( 1.0f / ( 1.0f + 1.0f ) ) );
}

Or to simplify:

Code:

int theanswer( FILE *fp )
{
    fclose( fp );
    return 1;
}

And save yourself the trouble of doing anything at all.

Quzah.

Thread: need help--data mining from file (some quirks)

Thread Tools

Search Thread

Display

need help--data mining from file (some quirks)

Similar Threads

Data Structure Eror

Abnormal Program Termination when executed from C:/Program Files

Bitmasking Problem

gcc link external library

File Database & Data Structure :: C++