Thread: Is parsing C's syntax slow?

  1. #1
    Registered User
    Join Date
    Oct 2021
    Posts
    138

    Is parsing C's syntax slow?

    I've heard people saying that parsing the syntax of the C programming language is slow. This statement was brought up when I was talking with some people about the syntax of more "modern" languages like Go, Rust, Kotlin etc. and some said that one of the reasons that we are moving away from C-like syntax is the parsing speed because parsing C syntax is slower.

    Does anyone knows if this statement is true because I can't see how we can really see important speed up by changing to a different syntax. Well there are things we can change and this is mostly to remove some stuff to not have to check for more cases (for example remove some operators like "++" and "--") but I can't think about anything else.

    I'm actually not only asking out of curiosity but because I want to create a transpiler to C (or modify the source code of tcc if I'm not lazy enough to read and understand it) and I want to learn how I can create a fast parser cause compilation speed are level 0 of importance! So yeah, someone that has some experience with parsing languages and can explain some things to me would be very much appreciated!

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    That's probably the most BS reason for choosing a language I've ever heard.

    > I'm actually not only asking out of curiosity but because I want to create a transpiler to C
    Blimey, you changed your mind quickly.
    Transpile to C or create a full compiler (with an assembler and linker)?
    Thanks tho I don't want to create my own language anymore.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Oct 2021
    Posts
    138
    > That's probably the most BS reason for choosing a language I've ever heard.

    I mean when it comes to creating a language tho, I do care about compile times so if there are things that I don't see and can give a 20-30% improvement then imo it's worth it!

    > <ALL THE REST>
    Yeah yeah you will probably think that I'm crazy and I don't know what I want (fun fact: that's true) but well.... Welcome to my world! The truth is that the language I was interested, while it is amazing, It is still in Alpha and not stable and productive ready and I'm always finding bugs that I report and because I want to write code ASAP, I cannot wait for it to be stable.

    Keep in mind that making my own language was always the ideal way but there are A LOT of problems and difficulties that I will have to consider before doing that so If I can find something that is ready and stable (which was the problem with the last language like I said) I will choose that. However, day by day I realize that making my own is the only way.

    The problem is that I'm lazy AS ........!!! Well I just give up when things become taught and difficult to understand. That's something I'm slowly fixing however, the last 3 years, I'm mostly depressing and I don't have motivation to do things that seem to hard because even the little hard things look like a chore. Some times it's better some times is worse. But it's mostly worse all this time. I never learnt to have goal and try as hard as it needs to achieve it. I will give up as soon as things start to look to hard for me and I'm start getting stressed out. Because I don't enjoy the rest of my day (with the exception of maybe 1-2 hours I'm with some friends), I think about at least relaxing and chilling for the rest of my free time so I can at least enjoy something out . So trying to learn how to create a compiler means learning and FULLY understanding assembly and then read the source of TCC and figure things out (and C and C++ source code seems extremely hard/annoying for some ........ing reason) and by the first time that seems to complicated, I give up.

    Anyway, the past is in the past and I decided that I will fix this problem and I will try my best. I'm currently reading the source code of TCC and we'll see. Even if I give up again, I know I will come back after some days (hopefully not more) and I will try again and again making a small progress at time. And maybe in some years I will be able to make something, lol!! Anyway, no more, you are not my psychiatrist so you shouldn't go though all this.

    I don't know if you are still reading but in any case, you have answer all of the questions I made in this forum so thanks a lot! And you have the best profile pic EVER!!!

  4. #4
    Registered User
    Join Date
    Apr 2021
    Posts
    140
    Quote Originally Posted by rempas View Post
    I've heard people saying that parsing the syntax of the C programming language is slow. This statement was brought up when I was talking with some people about the syntax of more "modern" languages like Go, Rust, Kotlin etc. and some said that one of the reasons that we are moving away from C-like syntax is the parsing speed because parsing C syntax is slower.

    Does anyone knows if this statement is true because I can't see how we can really see important speed up by changing to a different syntax. Well there are things we can change and this is mostly to remove some stuff to not have to check for more cases (for example remove some operators like "++" and "--") but I can't think about anything else.
    Yes, parsing C programs is slow. There are really two reasons for parsing C. One is "compilation." This is the one everybody focuses on, but that's wrong. The second reason, which occurs far more often than compilation, is parsing for development -- syntax highlighting, code completion, etc.

    Consider how C is described as being parsed by the standard: first, there is a "preprocessor" phase, which can make entire segments of code appear (#include) or disappear (#ifdef). It can also expand some bits of code into entire paragraphs of text (#define) which may change the nesting level of parens or braces or brackets. Then comes the "standard" C grammar.

    Standard C syntax was built by someone with a fascination for the latest thing. At the time, the latest thing was parser generators. As a result, C is not friendly to top-down recursive descent parsing. Instead, it is aimed directly at bottom-up stack-based parsing.

    Sadly, because the promise of parser generators was never realized, most C compilers to this day use hand-built top-down recursive-descent parsers. Not because of performance, but in order to provide better error handling. The primary job of a present-day compiler is not to compile code, but to detect syntax errors and point them out to the user.

    So as someone who wants to "parse" C code, you now have two problems: you don't know what the code really says until the preprocessor gets done, and it's hard to write a "smart" parser because the C grammar is optimized for "dumb" parsers.

    Consider:

    Code:
        #ifdef __STDC__
        size_t
        #else
        int
        #endif
        strlen(
        #   ifdef __STDC__
            const
        #endif
            char * s)
    Now, is that a function declaration or a function definition?

    [SPOILER]
    You don't know! Because the answer depends on the next character.
    [/SPOILER]

    The most unappreciated developers in the world are the ones trying to implement "syntax highlighting" in text editors.

    If you look at "modern" programming languages -- anything written since the 1980s or so -- you'll discover that almost all of them have a "leading keyword" approach to declaring/defining identifiers. (This is also true of languages like Pascal from before the 1980's. Those were the languages that were implemented using TDRD parsing!)

    In C, you might say:

    Code:
        int foo(int n, const char * s);
    The problem with this is that until you reach the ';' you don't know what's happening. Is it a declaration or a definition?

    In Julia the keyword 'function' appears as the first word:
    Code:
    function foo(n, s)
        ...
    end
    In Go, the keyword 'func' appears:
    Code:
    func foo(n int, s string) void {
        ...
    }
    In Ruby, the key words are 'def' and 'end', and there are no types:
    Code:
    def foo(n, s)
        ...
    end
    Why does this matter? Because if you have a key word right at the start that tells you what is happening, the language gets easier to parse! Compare this with C:

    Code:
    int
    What's coming next? You don't know! It's a type name, but it could be: a (declaration | definition) of a (global | static) (variable | function). Eight possible things!

    Code:
    int foo
    How about now? Well, there's no static keyword, so we can eliminate that possibility. But we don't know if it's a declaration, definition, function, or variable. Four possible things.

    Code:
    int foo(
    You and I reading this, we know this is a function declaration or definition. Two possible things! But any kind of parser is stuck down in a loop scanning for complex nested declarators, so it will be a while before it tells us the same thing.

    Code:
    int foo(int n, const char * s)
    Now? Nothing. We know the parameter list, but still don't know if it's a declaration or definition. Two possible things.

    Code:
    int foo(int n, const char *s) {
    Finally! The open-curly means this is a definition! Hooray! Only one possible thing now!

    Now let's do this one:

    Code:
    void (*signal(int sig, void (*func)(int)))(int);
    Last edited by aghast; 12-05-2021 at 03:01 PM. Reason: Add more stuff

  5. #5
    Registered User
    Join Date
    Sep 2020
    Posts
    150

    Thumbs up

    @aghast,
    that's really a good and detailed explanation.

  6. #6
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Quote Originally Posted by aghast View Post
    Yes, parsing C programs is slow. There are really two reasons for parsing C. One is "compilation." This is the one everybody focuses on, but that's wrong. The second reason, which occurs far more often than compilation, is parsing for development -- syntax highlighting, code completion, etc.

    Consider how C is described as being parsed by the standard: first, there is a "preprocessor" phase, which can make entire segments of code appear (#include) or disappear (#ifdef). It can also expand some bits of code into entire paragraphs of text (#define) which may change the nesting level of parens or braces or brackets. Then comes the "standard" C grammar.

    Standard C syntax was built by someone with a fascination for the latest thing. At the time, the latest thing was parser generators. As a result, C is not friendly to top-down recursive descent parsing. Instead, it is aimed directly at bottom-up stack-based parsing.

    Sadly, because the promise of parser generators was never realized, most C compilers to this day use hand-built top-down recursive-descent parsers. Not because of performance, but in order to provide better error handling. The primary job of a present-day compiler is not to compile code, but to detect syntax errors and point them out to the user.

    So as someone who wants to "parse" C code, you now have two problems: you don't know what the code really says until the preprocessor gets done, and it's hard to write a "smart" parser because the C grammar is optimized for "dumb" parsers.

    Consider:

    Code:
        #ifdef __STDC__
        size_t
        #else
        int
        #endif
        strlen(
        #   ifdef __STDC__
            const
        #endif
            char * s)
    Now, is that a function declaration or a function definition?

    [SPOILER]
    You don't know! Because the answer depends on the next character.
    [/SPOILER]

    The most unappreciated developers in the world are the ones trying to implement "syntax highlighting" in text editors.

    If you look at "modern" programming languages -- anything written since the 1980s or so -- you'll discover that almost all of them have a "leading keyword" approach to declaring/defining identifiers. (This is also true of languages like Pascal from before the 1980's. Those were the languages that were implemented using TDRD parsing!)

    In C, you might say:

    Code:
        int foo(int n, const char * s);
    The problem with this is that until you reach the ';' you don't know what's happening. Is it a declaration or a definition?

    In Julia the keyword 'function' appears as the first word:
    Code:
    function foo(n, s)
        ...
    end
    In Go, the keyword 'func' appears:
    Code:
    func foo(n int, s string) void {
        ...
    }
    In Ruby, the key words are 'def' and 'end', and there are no types:
    Code:
    def foo(n, s)
        ...
    end
    Why does this matter? Because if you have a key word right at the start that tells you what is happening, the language gets easier to parse! Compare this with C:

    Code:
    int
    What's coming next? You don't know! It's a type name, but it could be: a (declaration | definition) of a (global | static) (variable | function). Eight possible things!

    Code:
    int foo
    How about now? Well, there's no static keyword, so we can eliminate that possibility. But we don't know if it's a declaration, definition, function, or variable. Four possible things.

    Code:
    int foo(
    You and I reading this, we know this is a function declaration or definition. Two possible things! But any kind of parser is stuck down in a loop scanning for complex nested declarators, so it will be a while before it tells us the same thing.

    Code:
    int foo(int n, const char * s)
    Now? Nothing. We know the parameter list, but still don't know if it's a declaration or definition. Two possible things.

    Code:
    int foo(int n, const char *s) {
    Finally! The open-curly means this is a definition! Hooray! Only one possible thing now!

    Now let's do this one:

    Code:
    void (*signal(int sig, void (*func)(int)))(int);
    I feel like it's actually possible to go bottom up with C/C++ and actually compile faster as a result, for example when ever you encounter and #endif you would expect either a #if or #ifdef by the time you get back to line 0 in header 0, when encountering symbol usage you can just generate a declaration and mark it as implicit, should you find a declaration you simply update the implicit declaration, later when you actually go through the generated objects you can then check if the pointed declarations are implicit or explicit and if their usage matches the explicit declaration, for the contents of parsed #endif blocks you can just check the pointed #if/#ifdef block to see if it should be kept or disposed of before continuing to the next block. Heck maybe this method might be faster, I'll see if I can implement such a parser after I'm done trying to write up a custom signal based pseudo mutex/semaphore system for my threads (it's to ensure I have a fallback for things like the PS3 where you had to write your own thread system, even if I don't plan to program for those systems doesn't mean someone in the future who's branched my code won't).

  7. #7
    Registered User
    Join Date
    Oct 2021
    Posts
    138
    Quote Originally Posted by aghast View Post
    Yes, parsing C programs is slow....
    You explained the best way possible really! Thanks a lot man, I'm really grateful!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Array Syntax Versus Pointer Syntax
    By LyTning94 in forum C++ Programming
    Replies: 6
    Last Post: 12-06-2011, 10:56 AM
  2. QT GUI is very slow.
    By System_159 in forum C++ Programming
    Replies: 1
    Last Post: 04-08-2011, 11:49 PM
  3. Syntax parsing and Dictonaries
    By @nthony in forum C Programming
    Replies: 2
    Last Post: 05-26-2007, 12:05 PM
  4. String parsing(parsing comments out of HTML file)
    By slcjoey in forum C# Programming
    Replies: 0
    Last Post: 07-29-2006, 08:28 PM
  5. Parsing C and making syntax highlights, probelm with comments
    By Lynux-Penguin in forum C++ Programming
    Replies: 1
    Last Post: 07-22-2003, 11:20 AM

Tags for this Thread