hello all-
i'm a student (not too many on here huh?) and my assignment is to take a text file as input and perform data mining operations on it. more specifically, it needs to find the pair of words with the highest correlation index. the CI is found through the equation: 2 x (# of times both words appear in the same sentence ) / (# of times word A appears + # of times word B appears). that's all small business; i just wanted to let you know the details.
now, here's my quirk: i am NOT allowed to use the string.h and ctype.h libraries or functions.
basically, i don't know how to parse the input. the teacher suggested that we keep count of the words in the input, the number of occurences of each word, and the sentence index of each word. a sentence ends with ?, !, or . and a word-separator can be a blank space, /t, /n, ',' . : ; ? ! or -. each line will have no more than 1000 characters. the number of sentences will not exceed 30000. the number of words in a sentence will not exceed 1000. the number of words will not exceed 1000. the sum of the lengths of all distinct words will not exceed 10000. each word will occur in no more than 20 sentences. a word will not occur more than once in a sentence. the number of pairs that constitute a solution will not exceed 100. (some of that was copy/pasted from my spec sheet)
here's what i would like to do; please help me get started or let me know if i should do it another way.
1. scan in one line at a time- go from the start until input reaches one of the line terminators (read differently from a word seperator). place that into a char* or similar.
2. tokenize that line- implement the word seperators to find each word, get the index and increase word counts, store each word.
3. compare words- basically find if a word is repeated in another sentence.
4. repeat
ex: if the input is "see dog run! dog eats food for energy to run.", the pair of words would be "dog" and "run". (i didn't do the math for the CI). the program outputs the two words and the CI. there are special cases for a tie between many words but that's just an output modification.