Yes, parsing C programs is slow. There are really two reasons for parsing C. One is "compilation." This is the one everybody focuses on, but that's wrong. The second reason, which occurs
far more often than compilation, is parsing for development -- syntax highlighting, code completion, etc.
Consider how C is described as being parsed by the standard: first, there is a "preprocessor" phase, which can make entire segments of code appear (#include) or disappear (#ifdef). It can also expand some bits of code into entire paragraphs of text (#define) which may change the nesting level of parens or braces or brackets. Then comes the "standard" C grammar.
Standard C syntax was built by someone with a fascination for the latest thing. At the time, the latest thing was parser generators. As a result, C is not friendly to top-down recursive descent parsing. Instead, it is aimed directly at bottom-up stack-based parsing.
Sadly, because the promise of parser generators was never realized, most C compilers to this day use hand-built top-down recursive-descent parsers. Not because of performance, but in order to provide better error handling. The primary job of a present-day compiler is not to compile code, but to detect syntax errors and point them out to the user.
So as someone who wants to "parse" C code, you now have two problems: you don't know what the code really says until the preprocessor gets done, and it's hard to write a "smart" parser because the C grammar is optimized for "dumb" parsers.
Consider:
Code:
#ifdef __STDC__
size_t
#else
int
#endif
strlen(
# ifdef __STDC__
const
#endif
char * s)
Now, is that a function declaration or a function definition?
[SPOILER]
You don't know! Because the answer depends on the next character.
[/SPOILER]
The most unappreciated developers in the world are the ones trying to implement "syntax highlighting" in text editors.
If you look at "modern" programming languages -- anything written since the 1980s or so -- you'll discover that almost all of them have a "leading keyword" approach to declaring/defining identifiers. (This is also true of languages like Pascal from
before the 1980's. Those were the languages that were implemented using TDRD parsing!)
In C, you might say:
Code:
int foo(int n, const char * s);
The problem with this is that until you reach the ';' you don't know what's happening. Is it a declaration or a definition?
In Julia the keyword 'function' appears as the first word:
Code:
function foo(n, s)
...
end
In Go, the keyword 'func' appears:
Code:
func foo(n int, s string) void {
...
}
In Ruby, the key words are 'def' and 'end', and there are no types:
Code:
def foo(n, s)
...
end
Why does this matter? Because if you have a key word right at the start that tells you what is happening, the language gets easier to parse! Compare this with C:
What's coming next? You don't know! It's a type name, but it could be: a (declaration | definition) of a (global | static) (variable | function). Eight possible things!
How about now? Well, there's no
static keyword, so we can eliminate that possibility. But we don't know if it's a declaration, definition, function, or variable. Four possible things.
You and I reading this, we know this is a function declaration or definition. Two possible things! But any kind of parser is stuck down in a loop scanning for complex nested declarators, so it will be a while before it tells us the same thing.
Code:
int foo(int n, const char * s)
Now? Nothing. We know the parameter list, but still don't know if it's a declaration or definition. Two possible things.
Code:
int foo(int n, const char *s) {
Finally! The open-curly means this is a definition! Hooray! Only one possible thing now!
Now let's do this one:
Code:
void (*signal(int sig, void (*func)(int)))(int);