I'm giving myself an exercise at writing a basic parser for reading C programs. My logic involves state switches (in_string, inside_parens, in_block_comment, etc.) and looking at individual characters, as I'm using fgetc() for input.
I'm asking about the proper approach for parsing the file. I've written several hacked-out parsers for various needs over the years, but those were mostly short and sweet and disposable. My objective for writing a parser this time is to isolate all the function calls, standalone or nested.
I usually take a divide and conquer approach, removing the low hanging fruit that I don't care about, like comments, and the contents of strings and trailing blanks, etc.
Once I start reading characters, I'm setting states and keeping stats. How many characters read, how many "words" read (which, depending on your definition of a word, could mean many things), lines read, blah blah.
Once i find an interesting character, like a double quote, or a single quote, or a back slash inside a pair of aposts, I change states.
Should my parser loop, as it progresses with each new character, consider the state I'm in over the character I read, or should the character drive the state logic?
For example
orCode:while(get a char) { if (in_string) { .... }
What's your opinion?Code:while (get a char) { if (c=='"') { .... }
I'm finding it convenient to keep track of the previous character. I think it would also be handy to keep track of the next character. For example...
previous is '/'
current is '*'
next is '/'
Right now, if I'm working with current, an asterisk, I can see that previous is a slash. If I'm not in a string, then this is the start of a block comment. When I advance to the next loop, previous is now '*' and current is '/', and if I take into consideration current and previous, then I'm obviously at the end of a block comment, but that's not the real case. So, having 'next" would be handy in this case.