# Thread: How to parse a string with no delimiter?

1. ## How to parse a string with no delimiter?

I have a little program up and running...with which strings like "x + 2 y = z" can be meaningfully parsed (by using stringstreams )and used.

But it does not seem very 'nice' that the input has to be riddled with whitespaces separating the tokens. What is the best way of parsing something like "x+2y=z" ?

2. Pretty much the same way you're doing it now, I would suspect -- but instead of stopping at one particular character, you stop at "whatever doesn't make sense". So when you're trying to read a number (for instance), 'y' doesn't make sense, so you're done. When you're trying to read a variable name, '=' doesn't make sense. Etc.

3. Code:
`x+2y=z`
O_o

It looks like a fairly standard algebraic expression if you ask me.

I'm to lazy to look up a full grammar.

Do you know anything about parsing beyond "look for a delimiter"?

Soma

Code:
```variable := {'a', 'b', 'c', ..., 'x', 'y', 'z'}
operator : = {'+', '-', '/', '*', '='}
digit := {'0', '1', '2', ..., '8', '9'}
number := digit*
exponent := '^' number
coefficient := {'+', '-'} number
term := number {variable exponent}*
expression := {term operator}* term```

4. But a variable name or any identifier can also contain a number and characters ...How would I know if it is a number after it or the name ?

5. hen in that case you have to give up implied multiplication and forbid +, -, *, /, ^ etc in variable names.

6. Since I'm forbidding particular kinds of names...would trying to implement the whole set of C/C++ identifier && precedence rules make more sense without becoming too complicated ?

7. How would I know if it is a number after it or the name ?
A variation on C++ own "The Most Vexing Parse"...

Basically, without going into a lot of theory, if you have an ambiguous situation and you do not wish to impose syntax on your grammar for the sake of semantics, you formalize on one option over the other assuming that it is what is intended, or you can instead develop an ambiguous grammar relying on higher level facilities to distinguish semantics based on context. (In this example, context might be a list of variable names.)

In other words, if "x264" (^_^) is a valid variable name and expressions of the form "variable+coefficientvariablecoefficientvariable=v ariable" is a valid expression, you will just have to pick one method over the other by default while allowing people to "force" a given interpretation by grouping or some other logic.

For example, suppose as above that "x264" names a variable; further, suppose that you allow split coefficients of grouped terms for the sake of simplicity. ("2x2y" versus "4xy").

With that in mind, consider this example:

Code:
`x264y+19=z`
You can choose to either parse the expression as "x264" being one variable and "y" being the other or "x" and "y" being variables with "264" being a coefficient of "y". Whichever interpretation you choose, you always use that same rule.

Alternatively, you allow people to force a particular interpretation by grouping or some other mechanism.

Code:
`(x264*y)+19=z`
Code:
`(x*264y)+19=z`
Since I'm forbidding particular kinds of names...would trying to implement the whole set of C/C++ identifier && precedence rules make more sense without becoming too complicated ?
The full weight of C++ identifier naming and precedence would be a burden if you only want to parse algorithmic expression.

You could do it, but a simply "variable names are composed of only uppercase and lower case letters" would be more than sufficient and far less complicated.

Soma

8. The full weight of C++ identifier naming and precedence would be a burden if you only want to parse algorithmic expression.
..Hmm..I won't do that now...but for the sake of curiosity ....when is it necessary (other than writing compilers! ) ?

9. O_o

The idiotic combination of prefix, infix, and postfix grouping, token precedence differences, and imposed syntax for the sake of semantics all in a single declaration?

All the gods help you if you ever have to for any reason.

Soma

10. Parsers have two "levels" to them.

The lower level is called a lexer and it is responsible for dividing the input into tokens. The lexer is what would split "2y" in your example into the two tokens "2" and "y". This is separate from the higher level grammar you are trying to parse. At the lexical level you are just looking for where tokens begin and end.

The higher level is called the parser and it takes the tokens generated by the lexer and applies the grammar rules to them.

By separating the problem into these two parts it becomes easier to handle.