A C program to parse a C program

**Dino** · 4 Days Ago

I'm giving myself an exercise at writing a basic parser for reading C programs. My logic involves state switches (in_string, inside_parens, in_block_comment, etc.) and looking at individual characters, as I'm using fgetc() for input.

I'm asking about the proper approach for parsing the file. I've written several hacked-out parsers for various needs over the years, but those were mostly short and sweet and disposable. My objective for writing a parser this time is to isolate all the function calls, standalone or nested.

I usually take a divide and conquer approach, removing the low hanging fruit that I don't care about, like comments, and the contents of strings and trailing blanks, etc.

Once I start reading characters, I'm setting states and keeping stats. How many characters read, how many "words" read (which, depending on your definition of a word, could mean many things), lines read, blah blah.

Once i find an interesting character, like a double quote, or a single quote, or a back slash inside a pair of aposts, I change states.

Should my parser loop, as it progresses with each new character, consider the state I'm in over the character I read, or should the character drive the state logic?

For example

Code:

while(get a char) { 
   if (in_string) { .... }

or

Code:

while (get a char) { 
   if (c=='"') { .... }

What's your opinion?

I'm finding it convenient to keep track of the previous character. I think it would also be handy to keep track of the next character. For example...

previous is '/'
current is '*'
next is '/'

Right now, if I'm working with current, an asterisk, I can see that previous is a slash. If I'm not in a string, then this is the start of a block comment. When I advance to the next loop, previous is now '*' and current is '/', and if I take into consideration current and previous, then I'm obviously at the end of a block comment, but that's not the real case. So, having 'next" would be handy in this case.

**Dino** · 3 Days Ago

Y'all can ignore this. I've decided that for the purposes of just finding all the function calls, I don't need this level of parsing (tokenizing). I can probably just set up a regex to find function names once I de-comment it and isolate the strings, and I have both of those working already.

**aghast** · 2 Days Ago

Actually, you cannot.

It is well-known that C is ambiguous with respect to declarations vs. expressions, like:

A * B() = C;

In order to decide if that is a declaration or an expression statement, you must have a symbol table with type information. Regular expressions are not powerful enough for this.

You may be able to use a regex and coding standards to get 95% of the way there. In fact, that is what C programmers historically did. But with the advent of ANSI C and more and more things involving parentheses, the remaining 5% has started to get bigger and bigger.

If you write a lexer to do tokenization, plus the rudimentary symbol tracking necessary to recognize types, you will find yourself able to accomplish a surprising amount. And it will be useful code you might be able to re-use, as opposed to a snake-pit of regular expressions you won't understand three weeks from now.

Thread: A C program to parse a C program

Thread Tools

Search Thread

Display

A C program to parse a C program

Similar Threads

int data type past 32767 in program no error received (program runs)

Parse Error in my first program - help!!

Calling a program within a program within a program with redirected io streams

Parse a program for functions, variables

parse error on my program

Tags for this Thread