Thread: Lex and yacc for parsing a tree

  1. #1
    Registered User
    Join Date
    Dec 2014
    Posts
    1

    Lex and yacc for parsing a tree

    Hi all. I'm dealing with the task of creating a lexer and a parser that would recognize a tree built in this way: If a node doesn't have children it is only identified by itself , otherwise if he has children it will be followed by <children>. For example:

    <1,<2,3,<4,5>>>

    it's a tree that has 1 as root , 2 and 3 as children and 3 has 4 and 5 has children.

    Here's my lex code:


    Code:
    %{
    #include <stdlib.h>
    void yyerror(char *);
    #include "y.tab.h"
    %}
    
    %%
    
    [0-9]+ {return INTEGER;}
    
    [<>,] {return* yytext;}
    
    [ \t] ; /* Skip white spaces */
    
    .   yyerror("invalid character");
    
    %%
    int yywrap(void){
    return 1;
    }
    Now is the problem, with the yacc code , i had some ideas but the grammar doesn't work properly for reconizing the tree. I tought recursivley. A tree structure could be:

    <root,subtree>

    where subtree could be something like:

    <child1,<children1>,child2,child3<children3>>

    where child1 has children1 as children ,child2 has no children an child3 is the same for child1.So the problem is that i don't manage to build an efficient grammar section for the parser to recognize a tree build in that way.

    TY all how give some ideas.

  2. #2
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    I'm not familiar with lex/yacc. I'm wondering about what your grammar would look like in normal form.

    Is this what you are describing?

    Code:
    <tree> ::= <node>
    <node> ::= <number>
             | "<" <number> <child-list> ">"
    <child-list> ::= "," <node>
                   | "," <node> <child-list>
    No, that doesn't seem right. You want
    Code:
    <1,<2,3,<4,5>>>
    to mean "a tree with root 1, having children 2 and 3, where 2 is a leaf and 3 has children 4 and 5, which are both leaves."

    As far as trees are concerned, though, I think it'd be easier to read that as a different tree, without labeling the nodes: "root has two children: the first is leaf 1, while the other is a subtree whose children are leaves 2, 3, and another subtree with leaves 4, 5."

    Anyway, let me give your grammar another shot:
    Code:
    <tree> ::= <children>
    <children> ::= "<" <child-list> ">"
    <child-list> ::= <number>
                   | <number> "," <child-list>
                   | <number> "," <children>
                   | <number> "," <children> "," <child-list>
    That seems better. Not sure about lex/yacc, though.
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  3. #3
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    I'm not sure how the representation <A,B> is supposed to represent a tree.

    I suppose you could have the convention that <A,B> means a binary tree with an anonymous root and two children A, and B, but when you get to the leaves how do you represent it? <1> and <2>?

    It makes more sense something like <root,left_subtree,right_subtree>, at least if you want the inner nodes to be able to carry values.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  4. #4
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    The double role a comma has in your spec -- it is both a delimiter between node name, and a list delimiter between children -- hides a glaring problem in your spec.



    Please, let me show how I'd go about this. I'll use ABNF, which is what is used in internet RFC's, and definitely useful to know if you work with internet stuff.

    Let's start with a simple recursive definition similar to brewbuck and others suggested above, but with a colon instead of the first comma, just to make it a bit more readable.


    node := "<" value [ ":" [ node ] *( "," node ) ] ">"
    value := 1*digit
    digit := "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

    In other words, a digit is a single digit, a value is one or more digits, and a node begins with a < followed by a value. If the node has child nodes, a : follows, and then the comma-separated list of child nodes. Finally, a > closes the node.

    (I'm not assuming binary trees here; the above spec allows any number of leaves per node.)

    A leaf node with a value 5 would be
    <5> or
    <5:>

    If we have node 1 with child nodes 2 and 3, we could specify that with
    <1:<2>,<3>>

    To describe this tree,
    Code:
    │      3
    │     ╱ ╲
    │    ╱   ╲
    │   2     5
    │  ╱     ╱ ╲
    │ 1     4   6
    we could use
    Code:
    <3:<2:<1>>,<5:<4>,<6>>
    A simple recursive definition of a tree will always use pre-order tree traversal. This means that starting at a given node, you emit the node value, then descend into each child starting at the leftmost. Above, we start at 3, descend into 2, then into 1. Since 1 is a leaf node, and 2 only had the left child, we've done the first subtree of node 3. Then we descend to node 5, and from there to leaf node 4. Having completed 4, we only have the second subtree of node 5 to do, and that's node 6.

    The above paragraph is the key. If you don't understand it, compare the tree and its specification, and how the specification builds the tree or vice versa, until you grok it.

    I only used colons above, because it makes it easier to read (and write) the spec correctly. We can further augment the syntax, to not require angle brackets for a leaf node, and to make commas and colons optional and interchangeable:

    node := value / "<" value *( [ ":" / "," ] node ) ">"
    value := 1*digit
    digit := "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

    Using this augmented syntax, we could describe the above tree using
    <3:<2,1>,<5:4,6>> or
    <3,<2,1>,<5,4,6>> or
    <3<2,1>,<5,4,6>> or
    <3<2<1>><5<4><6>>>

    Although the separators make a big difference for human readability, it does not matter for a machine parser. So, which separators and where to use and require or leave optional, should be determined by what makes it easiest for humans to understand the format correctly.

    (Note: Programmers are humans, too. If you thought that "well, since this format is going to be read and written by computers only, I don't need to worry about how easy it is for humans to understand", you'd be wrong: the programmers have to understand the format to be able to implement it. So, human understanding of the format definitely matters, even if none of the users are human.)



    The original post describes this tree:
    Code:
    │    1
    │   ╱ ╲
    │  ╱   ╲
    │ 2     3
    │      ╱ ╲
    │     4   5
    Using the syntaxes defined above, we could describe it using
    <1:<2>,<3:<4>,<5>>> or
    <1,<2>,<3,<4>,<5>>> or
    <1,2,<3,4,5>>

    At this point it should be clear that the syntax OP (nickmenphis) wishes to use is either a logic bug, or requires something other than simply recursive specification: the < and > are used very differently than a simple recursive spec would.

    Indeed, comparing the text <1,<2,3,<4,5>>> to the tree above shows that in OP's spec, < and > are used to denote a level in the tree, filling nodes from right to left.

    Nicmenphis, your spec is not going to be easy to implement, and even if you implement it, it will be frustrating to your users. Can you rethink your spec?

  5. #5
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    If you can pick any syntax you wish, using just digits, <, >, and comma, then I'd suggest

    node := 1*digit [ "<" *( node "," ) node ">" ]
    value := 1*digit
    digit := "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

    For reasoning, consider the following binary search tree:
    Code:
    │      4
    │     ╱ ╲
    │    ╱   ╲
    │   2     6
    │  ╱ ╲   ╱ ╲
    │ 1   3 5   7
    This would be
    <4,<2,<1>,<3>>,<6,<5>,<7>>> or
    <4,<2,1,3>,<6,5,7>>
    using the specification in my previous post, but
    4<2<1,3>,6<5,7>>
    using the spec in this post.

    The first tree in my previous post would be 3<2<1>,5<4,6>> using the spec in this post.

    The tree mentioned in the initial post in this thread would be 1<2,3<4,5>> using the spec in this post.
    Last edited by Nominal Animal; 12-16-2014 at 03:37 AM.

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    *Moved to Tech Board*
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 1
    Last Post: 07-19-2014, 01:50 PM
  2. String parsing(parsing comments out of HTML file)
    By slcjoey in forum C# Programming
    Replies: 0
    Last Post: 07-29-2006, 08:28 PM
  3. draw tree graph of yacc parsing
    By talz13 in forum C Programming
    Replies: 2
    Last Post: 07-23-2006, 01:33 AM
  4. Parsing mathematical function to tree structure
    By Roule in forum C++ Programming
    Replies: 15
    Last Post: 10-04-2004, 03:36 PM
  5. parsing a binary tree
    By Unregistered in forum C Programming
    Replies: 3
    Last Post: 04-24-2002, 11:18 PM

Tags for this Thread