Thread: String Parsing

  1. #1
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43

    String Parsing

    Just to introduce myself again, My name is Shaun and I'm a 21 year old engineering student with about... six weeks of c++ experience.


    I main concern isn't my ability to get the job done. I have enough Matlab experience and logical thinking to accomplish this task. My worry is that with all the amazing libraries available for C++, there's probably a MUCH more efficient way of doing this.

    I have a text file. The data is arranged in the following format.



    <spaces> "Column 1 Data" <spaces> "Column 2 Data" <spaces> "Column 3 Data" <new line>



    The number of spaces is not consistent. I need to extract the data from column three and do n-gram analysis on it. If anyone is interested as to what exactly n-grams are and how they're used, I'll be happy to explain. However, for the sake of brevity, I'll just provide an example; this should be sufficient.


    For the string "Shaun" I would need to produce

    S
    h
    a
    u
    n
    Sh
    ha
    au
    un
    Sha
    hau
    aun
    Shau
    haun
    Shaun


    I should point out that I did NOT stop there because I had reached the length of the word. No matter what the size of the string, I will only break it up into a maximum string length of 5.


    So, using a column based approach, I was able to accomplish this using a combination of Matlab and Excel. However, I'd like to do it in Visual Studio C++ 7.1.

    My idea is to first use regular expressions to look for a space followed by any number of optional spaces. I'd replace every match with a comma, thus giving me a file delimited by commas and not a varying number of spaces.

    Next, I can use the ifstream.get() function to break up the columns, discarding the first and second column and writing the characters in the this column to an object str of the class string, while looking for a \n to stop on.

    Once I have str, I can break it up using... some function. This is the part I really need your help on.

    Once I have broken it up, I'll store the pieces somewhere (I can do this part later, it's more complicated and is my task for next week) and then loop through again, discarding columns 1 and 2 from the next line and so on.


    That's where I stand, I'm installing Boost right now and I'm reading up on the regular expression capabilities.

    Thanks!

  2. #2
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    The operator>> stops at spaces by default, so it is rather easy to read in values separated by space. If you know there are three values to a line and none have embedded spaces then something like this would read in a line at a time (assume fin is an input stream):
    Code:
    while (fin >> val1 >> val2 >> val3)
    {
        // process the three values here, or ignore the first two and process the third
    }
    That's much easier than using get().

  3. #3
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by Daved View Post
    The operator>> stops at spaces by default, so it is rather easy to read in values separated by space. If you know there are three values to a line and none have embedded spaces then something like this would read in a line at a time (assume fin is an input stream):
    Code:
    while (fin >> val1 >> val2 >> val3)
    {
        // process the three values here, or ignore the first two and process the third
    }
    That's much easier than using get().


    You're absolutely correct, and I actually knew that rule.

    I shoulda caught that one. Oh well, thanks, that probably saved me quite a bit of time already.

  4. #4
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    For your other problem, I'm not aware of a tool that does it directly, but you could probably do it fairly simply with a few loops.

  5. #5
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by Daved View Post
    For your other problem, I'm not aware of a tool that does it directly, but you could probably do it fairly simply with a few loops.
    Ok. I should be able to do that, I'm just wary of using loops. I have a habit of using loops for almost everything, when there's often a simpler, quicker method available.

    Thanks for your time Daved

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    This really is a job for loops. Specifically, you should be able to do it using two nested loops.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by CornedBee View Post
    This really is a job for loops. Specifically, you should be able to do it using two nested loops.
    Sorry, I can't tell if you're talking about my total post or what's left of my task after Daved helped me.

    Do you mean two loops total, or two loops nested within a third loop which throws away the first two columns?

  8. #8
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    He's referring to the part that is left after the reading in. If you include the read loop it would be three, but I would put the other two in a separate function and call that from the read loop.

  9. #9
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Thats what I was thinking.

    Thanks CornedBee and Daved.

  10. #10
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Almost forgot to do the follow up.

    Thanks to everyone who helped me with this, here's the code I eventually settled with.


    Code:
    #include <string>
    #include <iostream>
    using namespace std;
    int GRAM_MAX = 5;
    
    int main()
    {
        char yn;
        int pos = 1, strlen;
        string str;
        do
        {
               cout << "\nEnter string: ";
               getline(cin, str);
               strlen = str.length();
               for (int i = 1; i <= GRAM_MAX; i++)
                   for (int j = 0; j < (strlen - (i - 1)); j++)
                       cout << str.substr(j, i) << endl;
               cout << "\nAgain? (Y/N): ";
               cin >> yn;
               cin.ignore(1000, '\n');
               }while (yn == 'y' || yn == 'Y');
        return 0;
    }

    This is just the test program, as the actual program simply reads in the strings from a large text file.


    Thanks again everyone!
    Last edited by Shaun32887; 07-02-2008 at 01:58 PM.

  11. #11
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,412
    You might want to be careful there: strlen is the name of a function from <cstring>, which some library implementations might include in <string>. Combined with the using directive, this could cause a name collision with your strlen variable.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  12. #12
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by laserlight View Post
    You might want to be careful there: strlen is the name of a function from <cstring>, which some library implementations might include in <string>. Combined with the using directive, this could cause a name collision with your strlen variable.
    Yeah, that sounds like it might be a problem.

    Thanks laserlight, I'll go fix that.



    Fix:
    Code:
    #include <string>
    #include <iostream>
    using namespace std;
    int GRAM_MAX = 5;
    
    int main()
    {
        char yn;
        int pos = 1, numChar;
        string str;
        do
        {
               cout << "\nEnter string: ";
               getline(cin, str);
               numChar = str.length();
               for (int i = 1; i <= GRAM_MAX; i++)
                   for (int j = 0; j < (numChar - (i - 1)); j++)
                       cout << str.substr(j, i) << endl;
               cout << "\nAgain? (Y/N): ";
               cin >> yn;
               cin.ignore(1000, '\n');
               }while (yn == 'y' || yn == 'Y');
        return 0;
    }
    Last edited by Shaun32887; 07-02-2008 at 02:22 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. char Handling, probably typical newbie stuff
    By Neolyth in forum C Programming
    Replies: 16
    Last Post: 06-21-2009, 04:05 AM
  2. String parsing
    By broli86 in forum C Programming
    Replies: 3
    Last Post: 07-03-2008, 05:06 PM
  3. Message class ** Need help befor 12am tonight**
    By TransformedBG in forum C++ Programming
    Replies: 1
    Last Post: 11-29-2006, 11:03 PM
  4. Replies: 4
    Last Post: 03-03-2006, 02:11 AM
  5. creating class, and linking files
    By JCK in forum C++ Programming
    Replies: 12
    Last Post: 12-08-2002, 02:45 PM