Thread: Identify and count word sequences within a text file

  1. #1
    Registered User
    Join Date
    Jul 2011
    Posts
    22

    Identify and count word sequences within a text file

    Hi there,

    I'm not totally sure this is the best place to post this - I'm after a bit of assistance or be pointed in the right direction.

    I'm looking to analyse text in the sense of identifying sequences of words that appear multiple times with in a text file.

    Does anybody know of a class or library that i can use to do this?

    Many thanks in advance

    Freem

  2. #2
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,472
    It depends what you mean by sequences, if there is an easily identified fixed string to search for then std library objects like std::string iteslf will probably suffice, as the member functions like find_first_of replace, etc can be used.

    Like if you just want to find how many times 'the' occurs
    then you could just use getline() on your file stream, and parse each line as a temp string using find() for your text

    If it is likely to be more jumbled up and complex searching you might consider a regular expression library, or even a fuzzy matching algorithm
    Last edited by rogster001; 02-04-2012 at 08:16 AM.
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  3. #3
    Registered User
    Join Date
    Jul 2011
    Posts
    22
    Thanks for the reply.

    It's a bit more complex then just searching for a given string within text. Before searching the text, I wouldn't actually know what the search string would be. I'm hoping to create something that will learn what the common word sequences are within a file.

  4. #4
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,472
    well a simplistic (and expensive) approach is just brute force, you decide how many words form a sequence, say 3. Then probably buffer the file twice, step to word one in buffer1, get a three word sequence from theat start point and parse all through buffer2 every time a match for buffer1 sequence is found save the result for it. Then step to word two in buffer1, get three words from that start point, check it against buffer2, save results..then step to word 3 in buffer1 blah blah etc etc. Punctuation will have to be dealt with, included or discarded. If you have enough time you could do the whole process starting with single word, then two word seqwuence, then three, four, five, etc.
    But like i say this is very simplistic.
    Last edited by rogster001; 02-04-2012 at 12:41 PM.
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 1
    Last Post: 04-27-2011, 10:56 PM
  2. reading text-and-numbers file word by word
    By bored_guy in forum C Programming
    Replies: 22
    Last Post: 10-26-2009, 10:59 PM
  3. Getting the word count of a file
    By Beowolf in forum C++ Programming
    Replies: 3
    Last Post: 11-14-2007, 01:52 AM
  4. Help reading text file word by word
    By Unregistered in forum C++ Programming
    Replies: 6
    Last Post: 05-25-2002, 05:13 PM
  5. Again Character Count, Word Count and String Search
    By client in forum C Programming
    Replies: 2
    Last Post: 05-09-2002, 11:40 AM

Tags for this Thread