Lex and yacc for parsing a tree

**nickmenphis** · 12-14-2014

Hi all. I'm dealing with the task of creating a lexer and a parser that would recognize a tree built in this way: If a node doesn't have children it is only identified by itself , otherwise if he has children it will be followed by <children>. For example:

<1,<2,3,<4,5>>>

it's a tree that has 1 as root , 2 and 3 as children and 3 has 4 and 5 has children.

Here's my lex code:

Code:

%{
#include <stdlib.h>
void yyerror(char *);
#include "y.tab.h"
%}

%%

[0-9]+ {return INTEGER;}

[<>,] {return* yytext;}

[ \t] ; /* Skip white spaces */

.   yyerror("invalid character");

%%
int yywrap(void){
return 1;
}

Now is the problem, with the yacc code , i had some ideas but the grammar doesn't work properly for reconizing the tree. I tought recursivley. A tree structure could be:

<root,subtree>

where subtree could be something like:

<child1,<children1>,child2,child3<children3>>

where child1 has children1 as children ,child2 has no children an child3 is the same for child1.So the problem is that i don't manage to build an efficient grammar section for the parser to recognize a tree build in that way.

TY all how give some ideas.

**CodeMonkey** · 12-15-2014

I'm not familiar with lex/yacc. I'm wondering about what your grammar would look like in normal form.

Is this what you are describing?

Code:

<tree> ::= <node>
<node> ::= <number>
         | "<" <number> <child-list> ">"
<child-list> ::= "," <node>
               | "," <node> <child-list>

No, that doesn't seem right. You want

Code:

<1,<2,3,<4,5>>>

to mean "a tree with root 1, having children 2 and 3, where 2 is a leaf and 3 has children 4 and 5, which are both leaves."

As far as trees are concerned, though, I think it'd be easier to read that as a different tree, without labeling the nodes: "root has two children: the first is leaf 1, while the other is a subtree whose children are leaves 2, 3, and another subtree with leaves 4, 5."

Anyway, let me give your grammar another shot:

Code:

<tree> ::= <children>
<children> ::= "<" <child-list> ">"
<child-list> ::= <number>
               | <number> "," <child-list>
               | <number> "," <children>
               | <number> "," <children> "," <child-list>

That seems better. Not sure about lex/yacc, though.

**brewbuck** · 12-15-2014

I'm not sure how the representation <A,B> is supposed to represent a tree.

I suppose you could have the convention that <A,B> means a binary tree with an anonymous root and two children A, and B, but when you get to the leaves how do you represent it? <1> and <2>?

It makes more sense something like <root,left_subtree,right_subtree>, at least if you want the inner nodes to be able to carry values.

**Nominal Animal** · 12-16-2014

The double role a comma has in your spec -- it is both a delimiter between node name, and a list delimiter between children -- hides a glaring problem in your spec.

Please, let me show how I'd go about this. I'll use ABNF, which is what is used in internet RFC's, and definitely useful to know if you work with internet stuff.

Let's start with a simple recursive definition similar to brewbuck and others suggested above, but with a colon instead of the first comma, just to make it a bit more readable.

node := "<" value [ ":" [ node ] *( "," node ) ] ">"
value := 1*digit
digit := "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

In other words, a digit is a single digit, a value is one or more digits, and a node begins with a < followed by a value. If the node has child nodes, a : follows, and then the comma-separated list of child nodes. Finally, a > closes the node.

(I'm not assuming binary trees here; the above spec allows any number of leaves per node.)

A leaf node with a value 5 would be

<5> or
<5:>

If we have node 1 with child nodes 2 and 3, we could specify that with

<1:<2>,<3>>

To describe this tree,

Code:

│      3
│     ╱ ╲
│    ╱   ╲
│   2     5
│  ╱     ╱ ╲
│ 1     4   6

we could use

Code:

<3:<2:<1>>,<5:<4>,<6>>

A simple recursive definition of a tree will always use pre-order tree traversal. This means that starting at a given node, you emit the node value, then descend into each child starting at the leftmost. Above, we start at 3, descend into 2, then into 1. Since 1 is a leaf node, and 2 only had the left child, we've done the first subtree of node 3. Then we descend to node 5, and from there to leaf node 4. Having completed 4, we only have the second subtree of node 5 to do, and that's node 6.

The above paragraph is the key. If you don't understand it, compare the tree and its specification, and how the specification builds the tree or vice versa, until you grok it.

I only used colons above, because it makes it easier to read (and write) the spec correctly. We can further augment the syntax, to not require angle brackets for a leaf node, and to make commas and colons optional and interchangeable:

node := value / "<" value *( [ ":" / "," ] node ) ">"
value := 1*digit
digit := "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

Using this augmented syntax, we could describe the above tree using

<3:<2,1>,<5:4,6>> or
<3,<2,1>,<5,4,6>> or
<3<2,1>,<5,4,6>> or
<3<2<1>><5<4><6>>>

Although the separators make a big difference for human readability, it does not matter for a machine parser. So, which separators and where to use and require or leave optional, should be determined by what makes it easiest for humans to understand the format correctly.

(Note: Programmers are humans, too. If you thought that "well, since this format is going to be read and written by computers only, I don't need to worry about how easy it is for humans to understand", you'd be wrong: the programmers have to understand the format to be able to implement it. So, human understanding of the format definitely matters, even if none of the users are human.)

The original post describes this tree:

Code:

│    1
│   ╱ ╲
│  ╱   ╲
│ 2     3
│      ╱ ╲
│     4   5

Using the syntaxes defined above, we could describe it using

<1:<2>,<3:<4>,<5>>> or
<1,<2>,<3,<4>,<5>>> or
<1,2,<3,4,5>>

At this point it should be clear that the syntax OP (nickmenphis) wishes to use is either a logic bug, or requires something other than simply recursive specification: the < and > are used very differently than a simple recursive spec would.

Indeed, comparing the text <1,<2,3,<4,5>>> to the tree above shows that in OP's spec, < and > are used to denote a level in the tree, filling nodes from right to left.

Nicmenphis, your spec is not going to be easy to implement, and even if you implement it, it will be frustrating to your users. Can you rethink your spec?

**Nominal Animal** · 12-16-2014

If you can pick any syntax you wish, using just digits, <, >, and comma, then I'd suggest

node := 1*digit [ "<" *( node "," ) node ">" ]
value := 1*digit
digit := "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

For reasoning, consider the following binary search tree:

Code:

│      4
│     ╱ ╲
│    ╱   ╲
│   2     6
│  ╱ ╲   ╱ ╲
│ 1   3 5   7

This would be

<4,<2,<1>,<3>>,<6,<5>,<7>>> or
<4,<2,1,3>,<6,5,7>>

using the specification in my previous post, but

4<2<1,3>,6<5,7>>

using the spec in this post.

The first tree in my previous post would be 3<2<1>,5<4,6>> using the spec in this post.

The tree mentioned in the initial post in this thread would be 1<2,3<4,5>> using the spec in this post.

**laserlight** · 12-16-2014

*Moved to Tech Board*

Thread: Lex and yacc for parsing a tree

Thread Tools

Search Thread

Display

Lex and yacc for parsing a tree

Similar Threads

YACC abstract syntax tree error: line 47: fatal: must specify type for Begin

String parsing(parsing comments out of HTML file)

draw tree graph of yacc parsing

Parsing mathematical function to tree structure

parsing a binary tree

Tags for this Thread