Thread: need help--data mining from file (some quirks)

  1. #1
    Registered User
    Join Date
    Oct 2005
    Posts
    2

    need help--data mining from file (some quirks)

    hello all-

    i'm a student (not too many on here huh? ) and my assignment is to take a text file as input and perform data mining operations on it. more specifically, it needs to find the pair of words with the highest correlation index. the CI is found through the equation: 2 x (# of times both words appear in the same sentence ) / (# of times word A appears + # of times word B appears). that's all small business; i just wanted to let you know the details.

    now, here's my quirk: i am NOT allowed to use the string.h and ctype.h libraries or functions.

    basically, i don't know how to parse the input. the teacher suggested that we keep count of the words in the input, the number of occurences of each word, and the sentence index of each word. a sentence ends with ?, !, or . and a word-separator can be a blank space, /t, /n, ',' . : ; ? ! or -. each line will have no more than 1000 characters. the number of sentences will not exceed 30000. the number of words in a sentence will not exceed 1000. the number of words will not exceed 1000. the sum of the lengths of all distinct words will not exceed 10000. each word will occur in no more than 20 sentences. a word will not occur more than once in a sentence. the number of pairs that constitute a solution will not exceed 100. (some of that was copy/pasted from my spec sheet)

    here's what i would like to do; please help me get started or let me know if i should do it another way.

    1. scan in one line at a time- go from the start until input reaches one of the line terminators (read differently from a word seperator). place that into a char* or similar.

    2. tokenize that line- implement the word seperators to find each word, get the index and increase word counts, store each word.

    3. compare words- basically find if a word is repeated in another sentence.

    4. repeat

    ex: if the input is "see dog run! dog eats food for energy to run.", the pair of words would be "dog" and "run". (i didn't do the math for the CI). the program outputs the two words and the CI. there are special cases for a tie between many words but that's just an output modification.

  2. #2
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    >here's my quirk: i am NOT allowed to use the string.h and ctype.h libraries or functions.
    Write your own.
    My best code is written with the delete key.

  3. #3
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Another way might be to read the file in a character at a time and store words in an array of strings one sentence at a time. Logic might be something like this:
    Code:
    While not at the end of the file
      Read in a character
      If character is a sentence terminator
        process the sentence array and reset
      else
         if character is a word terminator
           Add terminator character to sentence[word_index][character_index]
           Increment word_index
        else
          Add that character to sentence[word_index][character_index]
          Incrememt character_index
    If you understand what you're doing, you're not learning anything.

  4. #4
    Registered User
    Join Date
    Oct 2005
    Posts
    2
    prelude- you can bet i tried but i'm guessing the point of this assignment wasn't to see who could write the best functions so after a while i hunkered down and got to starting it out the right way

    itsme86- thanks for that pseudocode, that's actually a great way to get me in the right direction.

    a few more questions, basing off what itsme wrote:

    1. would i need to compare every word in the first sentence to the next sentence, and so on? or, in other words, how would i go about checking to see how many times a pair of words appear together in a sentence?

    2. other than a malloc() with the given size limits, is there another way to create the storage arrays? memory is not really an issue but i was just wondering about efficiency and time.

    thanks for all the help so far

  5. #5
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Well, you have the "will not exceed" expectations so you could just create your array using those sizes.

    I guess you can keep track of pairs of words and how often they appear or you can read the entire file in and then do your calculations. You can expand the 2d array into a 3d one. Something like: document[sentences][words][characters].

    There's no one way to do it. Pick a way you like and run with it. Recognizing inefficiencies in algorithms and finding better ways to do something is all part of the learning process.
    If you understand what you're doing, you're not learning anything.

  6. #6
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by jayrobers
    2 x (# of times both words appear in the same sentence ) / (# of times word A appears + # of times word B appears)
    Hm... ok...

    Quote Originally Posted by jayrobers
    basically, i don't know how to parse the input. the teacher suggested that we keep count of the words in the input, the number of occurences of each word, and the sentence index of each word. a sentence ends with ?, !, or . and a word-separator can be a blank space, /t, /n, ',' . : ; ? ! or -. each line will have no more than 1000 characters. the number of sentences will not exceed 30000. the number of words in a sentence will not exceed 1000. the number of words will not exceed 1000. the sum of the lengths of all distinct words will not exceed 10000. each word will occur in no more than 20 sentences. a word will not occur more than once in a sentence. the number of pairs that constitute a solution will not exceed 100. (some of that was copy/pasted from my spec sheet)
    So in other words, just make your function:
    Code:
    float theanswer( FILE *fp )
    {
        fclose( fp );
        return ( 2.0f * ( 1.0f / ( 1.0f + 1.0f ) ) );
    }
    Or to simplify:
    Code:
    int theanswer( FILE *fp )
    {
        fclose( fp );
        return 1;
    }
    And save yourself the trouble of doing anything at all.

    Quzah.
    Hope is the first step on the road to disappointment.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Data Structure Eror
    By prominababy in forum C Programming
    Replies: 3
    Last Post: 01-06-2009, 09:35 AM
  2. Replies: 48
    Last Post: 09-26-2008, 03:45 AM
  3. Bitmasking Problem
    By mike_g in forum C++ Programming
    Replies: 13
    Last Post: 11-08-2007, 12:24 AM
  4. gcc link external library
    By spank in forum C Programming
    Replies: 6
    Last Post: 08-08-2007, 03:44 PM
  5. File Database & Data Structure :: C++
    By kuphryn in forum C++ Programming
    Replies: 0
    Last Post: 02-24-2002, 11:47 AM