Thread: Textual Analysis

  1. #1
    Registered User
    Join Date
    Feb 2014
    Posts
    2

    Textual Analysis

    I am looking to write a program that, given a particular word, looks at a plain text document and gives a list of words that appear within x words of the given word along with a count of how many times it appears.

    Do I need to use regex to do the pattern matching here? Is there a particular data structure that I should use that is particularly suited to a task like this? I don't want to reinvent the wheel, it seems like there should be libraries that would already do this sort of thing but searches have turned up nothing.

    Any advice anyone could offer would be appreciated.

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Regex doesn't sound necessary. Maybe just store words in a circular buffer (so that you have the previous x words) and when the word you just read in matches, add everything in the buffer to the list (and set a counter so that the next x words also get added).

    As for storing words + counts, a hash table seems reasonable.

  3. #3
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    As for libraries, try searching for some text-analysis keywords and buzz-words to find some interesting results. Here is one C library I found (haven't tried it myself):

    The `Bow' Toolkit

    Of course the area of text-analysis is big and also has plenty of non-C implementations that you could look at and adapt if necessary if you're building your own C library or project.

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    ... words that appear within x words of the given word along with a count of how many times it appears.
    This sounds like it could be done easily with a struct having two members:

    a char array for the word itself, and
    an int to hold the count for that word.

    That struct would be defined above your main function, but the array of structs would be defined inside main().

    I'm not clear what you mean by "within x words of the given word". Do you mean sequentially with x words of the given word, or within x words of the given word, in alphabetical (lexicographic possibly), order?

    If you could show an example or clarify, it would help. This may be simpler than you think.

  5. #5
    Registered User
    Join Date
    Feb 2014
    Posts
    2
    Quote Originally Posted by Adak View Post

    I'm not clear what you mean by "within x words of the given word". Do you mean sequentially with x words of the given word, or within x words of the given word, in alphabetical (lexicographic possibly), order?

    If you could show an example or clarify, it would help. This may be simpler than you think.
    What I would like is to enter three variables by the command line, a filename, a word and a number e.g.: tolstoy.txt, hat, 5

    The idea is that the program would look 5 words both ways from any (case insensitive) occurrence of "hat" in tolstoy.txt. The output would be a file that then lists all of the words that were the result of a positive match. next to each word you would see a number representing the number of times it was part of a positive match. Something like:-

    green 12
    big 14
    ugly 2
    my 8
    etc.

    For now I am not concerned with omitting common words like "the", but eventually that would be nice.

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Ok. Next question: how far backward do you want to look? Memory is going to be inadequate to save every word in the book, (probably), but a file could be used for this purpose, allowing a full look backward.

    Read and save one line of text at a time, into a char array. Get each word in that line of text, and compare it to the target word "hat". Count backward from where "hat" is, in the char array, the desired number of words.

    Each word found will be searched for in a binary search, of your program's list of words (a large 2D struct array that will be malloc'd, due to it's size). Print the word, and the words[i].count integer that is part of it's struct, (i being the index to each of the words you find within your target's desired range).

    First thing - create a small text file, and use fgets() to read in each line of text, and separate each word in that char array that holds the line of text.

    It would be very good to have the char array hold the maximum amount of words you will ever want to count backward from. (a bit more is OK.)

    This isn't a difficult program, but it's not a trivial one, either.

    The program is much faster than it seems it would be.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Beam or Frame Analysis for Structural Analysis
    By greenmania in forum C Programming
    Replies: 3
    Last Post: 05-05-2010, 05:40 PM
  2. Removing comments from textual files
    By Micko in forum C++ Programming
    Replies: 19
    Last Post: 08-09-2006, 09:36 AM
  3. zombie analysis
    By zedoo in forum Linux Programming
    Replies: 2
    Last Post: 10-07-2005, 09:15 AM
  4. Big-O analysis
    By Unregistered in forum C++ Programming
    Replies: 6
    Last Post: 06-26-2002, 01:21 PM