Thread: Parse trees for dummies?

  1. #1
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277

    Parse trees for dummies?

    I've been thinking about programming languages lately. Here's what I've come up with. A programming language can be said to be "fully bootstrapped in X" if it can compile itself to some other language (or even binary format) X while also being able to interpret itself at runtime. I argue that any language that can do such a thing is as close to ideal as possible.

    So I'd like to try coming up with one that's fully bootstrapped in Javascript. I've finally got the tokenizer working (code here) and now I can move on to parsing. That part shouldn't be too hard except with the part of interpreting the parse trees. Concepts like "shift" and "reduce" come to mind but the truth is I'm a little hazy with how to approach it. Is there a "simple" approach to doing this? I don't care about speed at this point. Just something that works.

  2. #2
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Wow...turns out the tokenizer needed way more work than I had originally thought! It's alright now but one really big issue does remain. Consider the following possible productions:

    (1) j+3 --> `j`, `+`, `3` /* Right */
    (2) j+3 --> `j`, `+3` /* Wrong */

    And:

    (3) j=+3 --> `j`, `=`, `+`, `3` /* Wrong */
    (4) j=+3 --> `j`, `=`, `+3` /* Right */

    It's sort of a tricky ambiguity. Currently the tokenizer would produce (2) and (4) while switching to the old method the result would be (1) and (3). So neither of those approaches are sufficient.

    I guess I could just check if the preceding character is an operator BUT at that phase of processing (tokenizing) there could be numerous spaces, newlines, etc ahead of said operator. Not to mention there may be things other than operators might come up.

    At this point it almost seems like a job to be done at the parsing level. It'll make things a little more complicated for the parser but in the end maybe the cleanest solution?

  3. #3
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    I think you're on the verge of turning your tokenizer into a parser. It does not need to be more complicated than separating tokens from white space. Separating tokens like "j=+3" or "j+3" into parts of your language, like expression, operator, expression, is what the parser does.

    I can not think of a better time to remove excess white space than during tokenizing, either. You want parsing to be as easy as it can be so that it's correct.
    Last edited by whiteflags; 11-17-2019 at 04:27 PM.

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by whiteflags
    Separating tokens like "j=+3" or "j+3" into parts of your language, like expression, operator, expression, is what the parser does.
    Those aren't tokens though. After lexical analysis, the tokens would be "j", "=", "+", "3" for the former. The parser would then parse and understand that the adjacent tokens "+" and "3" are grouped together to form an integer literal.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Something interesting to think about all this is how a tokenizer is really just like a lower-level parser that itself has received it's own input (characters) from another parser that processes individual bits.

    It's parsers all the way down man!

    Quote Originally Posted by whiteflags View Post
    I think you're on the verge of turning your tokenizer into a parser. It does not need to be more complicated than separating tokens from white space. Separating tokens like "j=+3" or "j+3" into parts of your language, like expression, operator, expression, is what the parser does.

    I can not think of a better time to remove excess white space than during tokenizing, either. You want parsing to be as easy as it can be so that it's correct.
    It can't be that simple though because there isn't always whitespace between each token. Like "foo+=bar;". In that case we should have four tokens: `foo`, `+=`, `bar`, `;`.

    Quote Originally Posted by laserlight View Post
    Those aren't tokens though. After lexical analysis, the tokens would be "j", "=", "+", "3" for the former. The parser would then parse and understand that the adjacent tokens "+" and "3" are grouped together to form an integer literal.
    Precisely. And yes I do agree now that it'd be best to have those operators handled by the parser. Much more straightforward and technically more suitable for that level of context.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Help in implementing N-ary trees or general trees in C.
    By Pro Neon in forum C Programming
    Replies: 3
    Last Post: 09-22-2017, 07:24 AM
  2. c for dummies
    By newprog82 in forum C Programming
    Replies: 3
    Last Post: 11-02-2010, 07:18 AM
  3. Binary Trees MINI MAXING, probability trees
    By curlious in forum General AI Programming
    Replies: 3
    Last Post: 09-30-2005, 10:57 AM
  4. C ++ for dummies...
    By tetraflare in forum C++ Programming
    Replies: 3
    Last Post: 11-15-2002, 09:34 PM
  5. traversing binary trees or partial trees
    By sballew in forum C Programming
    Replies: 4
    Last Post: 12-05-2001, 09:19 PM

Tags for this Thread