Thread: A quick parser

  1. #1
    Registered User
    Join Date
    Apr 2007
    Posts
    141

    A quick parser

    OK C++ gurus, I'd like a little advice with regards to what tool I should use to generate a quick parser. I need to parse a text file with a bunch of statements of the form
    var1 = 2.5
    var2 = 3
    thisvar = -42
    The values can always be assumed to be doubles and ultimately I will put the values into a hash table. I know that the standard library has a hash table I could use, but what about some lightweight parsing tools? I've done this kind of thing before in C using strtok(), but I assume that the standard library or maybe the boost library has something a little slicker?

    My only requirement is that I don't need to incur brain damage to understand it. This has to be banged out in a couple of hours at most. I wrote something like this in Haskell that was elegant and concise but took me several days to understand the Haskell library that I used.

  2. #2
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    >> I know that the standard library has a hash table I could use
    Actually, the hashed container is not in the standard library, but rather it is in the pseudo-standard TR1 library.

    This seems to be simple enough that a normal stream would work.
    Code:
    while (in >> varname >> equalsign >> doubleval)
    {
        // do work
    }

  3. #3
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    You make it sound like some monumental task...

    Code:
    std::pair<std::string, double> parse_statement(std::istream &in_p)
    {
        std::string var_name;
        in_p >> var_name;
        std::string equal_sign;
        in_p >> equal_sign;
        if(equal_sign != "=")
        {
            throw parse_error();
        }
        double value;
        in_p >> value;
        return std::make_pair(var_name, value);
    }
    ??

  4. #4
    Registered User
    Join Date
    Apr 2007
    Posts
    141
    Quote Originally Posted by brewbuck View Post
    You make it sound like some monumental task...

    Code:
    std::pair<std::string, double> parse_statement(std::istream &in_p)
    {
        std::string var_name;
        in_p >> var_name;
        std::string equal_sign;
        in_p >> equal_sign;
        if(equal_sign != "=")
        {
            throw parse_error();
        }
        double value;
        in_p >> value;
        return std::make_pair(var_name, value);
    }
    ??
    I'm a C++ newbie so please be patient with me. I didn't realize that the standard stream libraries parsed floating point strings. I presume it handles scientific notation and signs correctly as well? I had to write all of that logic myself last time I did it, including the tokenizer, so yeah it did take a little while.

  5. #5
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    There are rules for the format the floating point value must be in, so it depends on what exactly your format is. There are acceptable formats for both scientific notation and negative and positive signs are allowed as well.

    If you are forced to use a format that doesn't fit with what the stream expects, then you may have to do some extra work.

  6. #6
    Registered User
    Join Date
    Apr 2007
    Posts
    141
    Quote Originally Posted by brewbuck View Post
    You make it sound like some monumental task...
    Actually the code you provided doesn't quite work. Parsers are always a tricky business thanks to all the annoying special cases. Where things fail with this code is on lines like this.

    var1=23.0
    When you do something like
    fstr >> svar

    and svar is of type string, it will set svar = "var1=23.0". The tokenizer doesn't use = as a delimiter in the stream library, at least with MS visual studio 2008. I suspect there is a way to fix this in a hopefully straightforward manner, but I will have to research it a bit further.

  7. #7
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    and svar is of type string, it will set svar = "var1=23.0". The tokenizer doesn't use = as a delimiter in the stream library, at least with MS visual studio 2008.
    Formatted input with operator>> skips whitespace only. You could use getline() with '=', or use getline() to read the entire line into a string and then break it on the '='.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  8. #8
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    The format is important. Your original example used whitespace to delimit the tokens, which is why operator>> was the logical solution. If your input format is different, then you might have to use some other solution. Do you know the exact format of the input? Do you know the exact format of the floating point numbers in the input?

  9. #9
    Registered User
    Join Date
    Apr 2007
    Posts
    141
    Quote Originally Posted by laserlight View Post
    Formatted input with operator>> skips whitespace only. You could use getline() with '=', or use getline() to read the entire line into a string and then break it on the '='.
    Yeah that's what I ended up doing actually. I used getline(). Funny thing though about the logic behind the stream state. My basic routine looks like this:

    Code:
    typedef std::map<std::string, double> parmap ;
    typedef enum {VARKEY, NUMVAL, NEWLINE} parstate ;
    parmap parsefile(const std::string filename)
    {
    using namespace std ;
    
    parmap pmap ;
    int linenum = 1 ;
    ifstream fstr(filename.c_str(), ifstream::in) ;
    rassert(fstr.is_open(), " Can't open " + filename) ;
    string vname , dstr ;
    double dval ;
    // initial parser state
    parstate pstate = VARKEY ;
    
    while(fstr.good()) {
    	switch (pstate) {
    		case VARKEY:
    			getline(fstr, vname, '=') ;
    			//rassert(fstr.good(), "line: " + to_string(linenum) + " expecting = ") ;
    			if (fstr.good()) {
    			/* validate variable name */
    			rassert(isalpha(vname.at(0)),"line: " + to_string(linenum) + 
    				" " + vname + " Is not a valid variable name") ;
    			pstate = NUMVAL ;
    			}
    			break ;
    	
    		case NUMVAL:
    			fstr >> dval ;
    			/* insert into map */
    			pmap[vname] = dval ;
    			pstate = NEWLINE  ;
    			break ;
    		case NEWLINE:
    			/* discard the newline */
    			getline(fstr, vname) ;
    			linenum++ ;
    			pstate = VARKEY ;
    			
    			break ;
    		default:
    			rassert(false, " unknown parser state") ;
    	}
    } // end while(fstr)
    fstr.close() ;
    return(pmap) ;
    
    } // end parsefile()
    rassert() is just an assert like routine that throws an exception with a dumb message.
    to_string() just converts the numerical object to a string. I realize I didn't have to set this up as a state machine since the states occur sequentially, however just in case I wanted to extend the parser I thought it was a good idea to use this type of format.

    Now I had originally thought I'd do a little bit of validation of the input text formatting. In particular if I were expecting a variable name, it should be followed by an '=' character, hence the getline(fstr, vname, '=') ; statement.

    My problems began when I realized that my input files have each assignment statement delimitted by a newline, so I decided to make that a requirement, hence the extra getline() in the NEWLINE state. Without this, in fact the getline() in the next VARKEY state would return with an empty variable and I would throw an exception (I think) when I tried to read a double.

    But the trouble occurs at the end of the file. Suppose there is no newline; well the last getline terminates (I thought) at eof. I thought this would set the state of the stream so that it would then kick out of the loop and all would be well. Nope. It requires another getline call to get an error state in the stream. The simple way for me to handle this was to disable the error checking in the VARKEY state and let the loop kick back up to the while conditional after calling getline() one last (IMHO unnecessary) time.

    That's how I ended up with this ugly (but working) code.

  10. #10
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    If there are no characters after the last value then the last getline should cause good() to be false. However, normally there is a newline, so that getline will read it and the loop will continue and only fail during the VARKEY state. You could check for both good() and eof() in your while loop control, but if I remember correctly that won't help in this situation because the getline will stop at the newline and eof() won't be set yet.

    If you didn't use the state machine, you could do this:
    Code:
    while (getline(fstr, vname, '='))
    {
      fstr >> dval;
      getline(fstr, dummy);
    }
    That will break regardless of whether there is a newline at the end of the input file or not.

  11. #11
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by SevenThunders View Post
    Actually the code you provided doesn't quite work. Parsers are always a tricky business thanks to all the annoying special cases.
    Your example of the grammar seemed to indicate that all tokens would be whitespace separated. If not, then obviously my method isn't enough.

  12. #12
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    You can always use boost::spirit to make a parser:

    Code:
    rule<> parser = *anychar_p >> ch_p('=') >> real_p;
    Boom, a parser

  13. #13
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    There, now, brewbuck, that doesn't count. You don't store the data.
    Code:
    typedef map<string, float> valmap;
    typedef valmap::value_type value;
    valmap values;
    value v;
    rule <> parser = ((*anychar_p)[assign_a(v.first)] >> '=' >> real_p[assign_a(v.second)])[insert_at_a(values, v)];
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  14. #14
    Registered User
    Join Date
    Apr 2007
    Posts
    141
    Quote Originally Posted by brewbuck View Post
    You can always use boost::spirit to make a parser:

    Code:
    rule<> parser = *anychar_p >> ch_p('=') >> real_p;
    Boom, a parser

    That's pretty cool. Reminds me of some of the Haskell parser libraries.
    Just out of curiosity, suppose I wanted to support more types other than just double, say int or maybe later string for the variable I am reading. Of course the actual type would be determined dynamically. The Haskell library had a concept of OR ing in possible ways of reading a token. They overloaded an operator for this purpose. Something like,

    Code:
    rule<> parser = *anychar_p >> ch_p('=') >> (real_p || int_p || string_p) ;
    Applying the usual short circuit semantics.

  15. #15
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by SevenThunders View Post
    Code:
    rule<> parser = *anychar_p >> ch_p('=') >> (real_p || int_p || string_p) ;
    Change those '||' to '|' and boost::spirit likes it. spirit also has an '||' operator but it does something else (see the docs for details). Don't be fooled, it still short-circuits, even though '|' is not traditionally a short-circuiting operator. boost::spirit allows other forms of alternation, such as "pick shortest match" or "pick longest match."
    Last edited by brewbuck; 04-02-2008 at 03:24 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. strcmp returning 1...
    By Axel in forum C Programming
    Replies: 12
    Last Post: 09-08-2006, 07:48 PM
  2. Do you know...
    By davejigsaw in forum C++ Programming
    Replies: 1
    Last Post: 05-10-2005, 10:33 AM
  3. Problem with a file parser.
    By Hulag in forum C++ Programming
    Replies: 7
    Last Post: 03-17-2005, 09:54 AM
  4. recursive quick sort - stack overflow
    By Micko in forum C Programming
    Replies: 9
    Last Post: 01-01-2005, 05:51 PM
  5. Parser - need help!
    By thelma in forum C Programming
    Replies: 2
    Last Post: 04-05-2004, 08:06 PM