Thread: How to parse a string with no delimiter?

  1. #1
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657

    How to parse a string with no delimiter?

    I have a little program up and running...with which strings like "x + 2 y = z" can be meaningfully parsed (by using stringstreams )and used.

    But it does not seem very 'nice' that the input has to be riddled with whitespaces separating the tokens. What is the best way of parsing something like "x+2y=z" ?

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Pretty much the same way you're doing it now, I would suspect -- but instead of stopping at one particular character, you stop at "whatever doesn't make sense". So when you're trying to read a number (for instance), 'y' doesn't make sense, so you're done. When you're trying to read a variable name, '=' doesn't make sense. Etc.

  3. #3
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Code:
    x+2y=z
    O_o

    It looks like a fairly standard algebraic expression if you ask me.

    I'm to lazy to look up a full grammar.

    Do you know anything about parsing beyond "look for a delimiter"?

    Soma

    Code:
    variable := {'a', 'b', 'c', ..., 'x', 'y', 'z'}
    operator : = {'+', '-', '/', '*', '='}
    digit := {'0', '1', '2', ..., '8', '9'}
    number := digit*
    exponent := '^' number
    coefficient := {'+', '-'} number
    term := number {variable exponent}* 
    expression := {term operator}* term

  4. #4
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    But a variable name or any identifier can also contain a number and characters ...How would I know if it is a number after it or the name ?

  5. #5
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    hen in that case you have to give up implied multiplication and forbid +, -, *, /, ^ etc in variable names.

  6. #6
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Since I'm forbidding particular kinds of names...would trying to implement the whole set of C/C++ identifier && precedence rules make more sense without becoming too complicated ?

  7. #7
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    How would I know if it is a number after it or the name ?
    A variation on C++ own "The Most Vexing Parse"...

    Basically, without going into a lot of theory, if you have an ambiguous situation and you do not wish to impose syntax on your grammar for the sake of semantics, you formalize on one option over the other assuming that it is what is intended, or you can instead develop an ambiguous grammar relying on higher level facilities to distinguish semantics based on context. (In this example, context might be a list of variable names.)

    In other words, if "x264" (^_^) is a valid variable name and expressions of the form "variable+coefficientvariablecoefficientvariable=v ariable" is a valid expression, you will just have to pick one method over the other by default while allowing people to "force" a given interpretation by grouping or some other logic.

    For example, suppose as above that "x264" names a variable; further, suppose that you allow split coefficients of grouped terms for the sake of simplicity. ("2x2y" versus "4xy").

    With that in mind, consider this example:

    Code:
    x264y+19=z
    You can choose to either parse the expression as "x264" being one variable and "y" being the other or "x" and "y" being variables with "264" being a coefficient of "y". Whichever interpretation you choose, you always use that same rule.

    Alternatively, you allow people to force a particular interpretation by grouping or some other mechanism.

    Code:
    (x264*y)+19=z
    Code:
    (x*264y)+19=z
    Since I'm forbidding particular kinds of names...would trying to implement the whole set of C/C++ identifier && precedence rules make more sense without becoming too complicated ?
    The full weight of C++ identifier naming and precedence would be a burden if you only want to parse algorithmic expression.

    You could do it, but a simply "variable names are composed of only uppercase and lower case letters" would be more than sufficient and far less complicated.

    Soma

  8. #8
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    The full weight of C++ identifier naming and precedence would be a burden if you only want to parse algorithmic expression.
    ..Hmm..I won't do that now...but for the sake of curiosity ....when is it necessary (other than writing compilers! ) ?

  9. #9
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    The idiotic combination of prefix, infix, and postfix grouping, token precedence differences, and imposed syntax for the sake of semantics all in a single declaration?

    All the gods help you if you ever have to for any reason.

    Soma

  10. #10
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Parsers have two "levels" to them.

    The lower level is called a lexer and it is responsible for dividing the input into tokens. The lexer is what would split "2y" in your example into the two tokens "2" and "y". This is separate from the higher level grammar you are trying to parse. At the lexical level you are just looking for where tokens begin and end.

    The higher level is called the parser and it takes the tokens generated by the lexer and applies the grammar rules to them.

    By separating the problem into these two parts it becomes easier to handle.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  11. #11
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Then I already have made the higher level parser part...
    Should I make the lexer completely independent of the parser ?
    My parser handles a 'deque' of token(a class) objects..
    Am I right in saying that the lexer should generate a deque<token> from a std::string ?

  12. #12
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by manasij7479 View Post
    Am I right in saying that the lexer should generate a deque<token> from a std::string ?
    You could, but it's just as easy to generate the tokens on demand (and requires less memory anyway)
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. how to parse a string
    By -EquinoX- in forum C Programming
    Replies: 48
    Last Post: 05-21-2008, 01:57 PM
  2. how to parse a string?
    By kalamram in forum C Programming
    Replies: 3
    Last Post: 01-10-2007, 02:49 PM
  3. How to set a delimiter for each line's string?
    By Mathsniper in forum C Programming
    Replies: 1
    Last Post: 12-14-2006, 10:09 AM
  4. Parse the string
    By swanley007 in forum C++ Programming
    Replies: 4
    Last Post: 11-04-2005, 11:17 AM
  5. String Parse.
    By Coder87C in forum C# Programming
    Replies: 3
    Last Post: 08-16-2005, 02:18 AM

Tags for this Thread