Thread: Verifying absence of brackets and parentheses

  1. #16
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by phantomotap View Post
    I think conditions like "single quote" are not consumed/discarded properly with respect to updating the state because only the characters representing the start of such a sequence is consumed/discarded before transition back into a state where `actives' may be updated.
    There is a bug on line 114, where there is an extra else, that caused NORMAL_CODE to fall through to SINGLE_QUOTED. Remove the extra else, and the code seems to work.

    You see, the loop iteration always begins by reading the next character. The first ' in NORMAL_CODE state causes the state to change to SINGLE_QUOTED, but by the time you get to the relevant case statement, the c now refers to a new character because it was read in the loop condition. Since there is no transition to NORMAL_CODE unless there is an end of input (which ends the loop) or a ', single-quoted input is consumed witihin that state.

    Note that this code is only intended to scan, not fix, any input.

  2. #17
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    I decided to confirm my assumption, but on reading your post again I guess you didn't aim in that direction?

    You don't want to do nesting coherence for string and character literals at all?

    I'm only confused because it does look like you put effort in that direction for the other bits.

    [Edit]
    To see what I'm talking about put a single '\'' character on line 151 (That would be before applying the `else' fix which I did for consistency.). The error you get is unrelated to the '\'' character because the code treats several lines as being part of the character literal.
    [/Edit]

    Soma

  3. #18
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by phantomotap View Post
    I decided to confirm my assumption, but on reading your post again I guess you didn't aim in that direction?
    You're absolutely right, sorry!

    I took off on the tangent thames stated in post #6, but neglected to mention it. Ouch.

    Before post #6, the code and logic I've posted targets the problem thames stated in the original post in this thread.

    After post #6, my code and logic targets the K&R problem thames mentioned: "Write a program to check a C program for rudimentary syntax errors like unbalanced parentheses, brackets and braces. Don't forget about quotes, both single and double, escape sequences, and comments. "

    Apologies for derailing the thread.

    Quote Originally Posted by phantomotap View Post
    To see what I'm talking about put a single '\'' character on line 151 (That would be before applying the `else' fix which I did for consistency.). The error you get is unrelated to the '\'' character because the code treats several lines as being part of the character literal.
    Yes. There are two reasons why that happens.

    First, the state machine progresses through the code, and assumes the code thus far is correct. When it encounters a single quote, it must assume it is the start of a character literal.

    Second, the state machine treats string literals and character literals exactly the same way; it does not make any assumptions about the length or contents of the character literal. This is truly a bug, because string literals are not allowed to contain unescaped newlines, whereas string literals are.

    Fixing the bug does seem to make it yield correct output. Adding fflush(stdout); just before outputting any error messages will also help pinpoint the location of the error, since the error message should appear right after the character that made the machine to recognize the error was output.


    Let me also say a big thank you. Not only just because you've pointed at least two bugs in my code already, but you have made me think about the state machine from different perspective, not just from the viewpoints that come to me naturally. I do my best work in a team where that happens and is encouraged; where discovering holes and bugs and weaknesses in ones implementation is genuinely appreciated. I definitely do, my code can always benefit from another pair of sharp eyes.

    If you or anybody else have the time and interest to test it, here is the latest version of my code with both fixes and the enhancements included:
    Code:
    #include <stdio.h>
    
    #define MAX_ACTIVES 1024
    
    enum comment_states {
        NORMAL_CODE = 0,
        SINGLE_QUOTED,
        DOUBLE_QUOTED,
        AFTER_SLASH,
        CPP_COMMENT,
        C_COMMENT,
        C_COMMENT_ASTERISK
    };
    
    static unsigned long line = 1UL;
    
    static inline int next(void)
    {
        int  c;
    
        c = getc(stdin);
    
        if (c == '\n') {
            line++;
            c = getc(stdin);
            if (c != '\r') {
                ungetc(c, stdin);
                fputc('\n', stdout);
            } else
                fputs("\n\r", stdout);
            return '\n';
    
        } else
        if (c == '\r') {
            line++;
            c = getc(stdin);
            if (c != '\n') {
                ungetc(c, stdin);
                fputc('\r', stdout);
            } else
                fputs("\r\n", stdout);
            return '\n';
    
        } else
        if (c != EOF) {
            fputc(c, stdout);
            return c;
        }
    
        return EOF;
    }
    
    
    static inline int pair(const int c)
    {
        switch (c) {
        case '(': return ')';
        case ')': return '(';
        case '[': return ']';
        case ']': return '[';
        case '{': return '}';
        case '}': return '{';
        default:  return '\0';
        }
    }
    
    
    int main(void)
    {
        enum comment_states  state = NORMAL_CODE;
        char                 active[MAX_ACTIVES];
        int                  actives = 0;
        int  c;
    
        while (EOF != (c = next()))
            switch (state) {
    
            case AFTER_SLASH:
                if (c == '/') {
                    state = CPP_COMMENT;
                    break;
                } else
                if (c == '*') {
                    state = C_COMMENT;
                    break;
                }
                state = NORMAL_CODE;
    
            case NORMAL_CODE:
                if (c == '/')
                    state = AFTER_SLASH;
                else
                if (c == '"')
                    state = DOUBLE_QUOTED;
                else
                if (c == '\'')
                    state = SINGLE_QUOTED;
                else
                if (c == '(' || c == '[' || c == '{') {
                    if (actives >= MAX_ACTIVES) {
                        fflush(stdout);
                        fprintf(stderr, "Line %lu: Too deep nesting.\n", line);
                        return 1;
                    }
                    active[actives++] = c;
                } else
                if (c == ')' || c == ']' || c == '}') {
                    if (actives < 1) {
                        fflush(stdout);
                        fprintf(stderr, "Line %lu: '%c' without a prior '%c'.\n", line, c, pair(c));
                    } else
                    if (active[actives - 1] != pair(c)) {
                        fflush(stdout);
                        fprintf(stderr, "Line %lu: '%c', but expected '%c'.\n", line, c, pair(active[actives - 1]));
                    } else
                        actives--;
                }
                break;
    
            case SINGLE_QUOTED:
                if (c == '\\')
                    next();
                else
                if (c == '\'')
                    state = NORMAL_CODE;
                else
                if (c == '\n') {
                    fflush(stdout);
                    fprintf(stderr, "Line %lu: Missing terminating ' character.\n", line - 1UL);
                    state = NORMAL_CODE;
                }
                break;
    
            case DOUBLE_QUOTED:
                if (c == '\\')
                    next();
                else
                if (c == '\"')
                    state = NORMAL_CODE;
                break;                
    
            case CPP_COMMENT:
                if (c == '\n')
                    state = NORMAL_CODE;
                break;
    
            case C_COMMENT:
                if (c == '*')
                    state = C_COMMENT_ASTERISK;
                break;
    
            case C_COMMENT_ASTERISK:
                if (c == '/')
                    state = NORMAL_CODE;
                else
                if (c != '*')
                    state = C_COMMENT;
                break;
            }
    
        fflush(stdout);
    
        if (state == C_COMMENT || state == C_COMMENT_ASTERISK)
            fprintf(stderr, "Line %lu: Expected end of comment */ before end of input.\n", line);
        else
        if (state == DOUBLE_QUOTED)
            fprintf(stderr, "Line %lu: Expected '\"' before end of input.\n", line);
        else
        if (state == SINGLE_QUOTED)
            fprintf(stderr, "Line %lu: Expected '\'' before end of input.\n", line);
    
        while (actives > 0)
            fprintf(stderr, "Line %lu: Expected '%c' before end of input.\n", line, pair(active[--actives]));
    
        return 0;
    }
    While the code is nowhere near a C lexer (scanner), to me it is surprisingly compact compared to what it achieves.

    Considering the C syntax, I think that if you wanted to add new features like detecting missing semicolons, it would be easier to overhaul the state machine, and make it into a genuine C parser/scanner instead. The state map would become completely different, but the basic structure should not be that much different.

    C syntax is complex enough that I'd actually turn to a parser generator. For example, GNU Bison can generate the C, C++, or Java code needed, and you can find syntax files for Bison to generate a C parser on the net.

  4. #19
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    You're absolutely right, sorry!
    Apologies for derailing the thread.
    *shrug*

    No foul here.

    I also rather doubt the original poster is much put off considering that he could have learned a bit from the discussion.

    Let me also say a big thank you.
    Cheers.

    Soma

  5. #20
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    Quote Originally Posted by phantomotap View Post
    I also rather doubt the original poster is much put off considering that he could have learned a bit from the discussion.
    Indeed. Every bit of information is appreciated.
    I was wondering about portability. Has the program to be ansi compliant to be portable? if yes, does inline make the program unportable?
    Last edited by thames; 11-04-2012 at 08:46 AM.

  6. #21
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by thames View Post
    Has the program to be ansi compliant to be portable?
    All the environments I personally care about have C99 support. In fact, I'm more concerned with C++98/C++08/C++11 compatibility nowadays.

    ANSI C refers to ANSI X3.159-1989, also known as C89, also known as ISO/IEC 9899:1990, also known as C90. (Yes, C89 and C90 are the exact same version.)

    C99 provides "static inline" with the exact same semantics as it is used in C++, and flexible array members (final member in a structure with length determined at compile time). I consider these features essential for C programming. I often use the __thread keyword (GNU extension for thread-local storage, _Thread_local in C11) and compiler-provided atomic and vector built-ins (provided by GCC, ICC, and Pathscale compilers at least).

    Microsoft is the only vendor whose C compiler does not support C99, to my knowledge. I stopped using their products seven years ago, so I wouldn't really know for sure, though. However, their compiler is, I understand, a combined C/C++ compiler, which means it is likely to support at least static inline (since it is C++).

    (Then there are also some ancient environments; you might one day need to program an existing factory robot or something, that only has an ANSI C compiler available.)

    So, the answer depends entirely on the gamut of systems you consider. I limit myself to sane POSIX systems, so to me, the code is perfectly portable. Windows programmers might disagree.

    Quote Originally Posted by thames View Post
    if yes, does inline make the program unportable?
    As it is just an optimization feature, more used to indicate programmer intent than anything else, you can always drop it, and make the code plain C89.

  7. #22
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    Quote Originally Posted by Nominal Animal View Post
    Windows programmers might disagree. As it is just an optimization feature, more used to indicate programmer intent than anything else, you can always drop it, and make the code plain C89.
    -_- I prefer to stick to C99.

    Edit: I didn't know ANSI refers to C89.
    Last edited by thames; 11-04-2012 at 10:44 AM.

  8. #23
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    Code:
    active[actives]++;
    .
    .
    .
    if (active[actives - 1] != pair(c)) {
                        fflush(stdout);
                        fprintf(stderr, "Line %lu: '%c', but expected '%c'.\n", line, c, pair(active[actives - 1]));
                    } else
                        actives--;
    Nominal, I didn't understand this section. Let's say I put inside the array:

    Code:
    active[0] = '(' 
    active[1] = '}' 
    active[2] = ...
    how can

    Code:
     active[actives-1]
    be parentheses if the array stopped at index 2? To me it would be curly brackets in consequence of

    Code:
     active[actives++] = c
    is it because

    Code:
     actives--
    ? if it is, I still don't understand since the routine only decrements if the pairs coincide.

  9. #24
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by thames View Post
    Nominal, I didn't understand this section.
    No, I think you got it correct.

    Initially, actives = 0. When the code encounters an open parens, brace, or bracket, (c == '(' || c == '[' || c == brace), it does
    Code:
        active[actives++] = c;
    The post-increment occurs after the assignment; in other words, the above does the exact same thing as
    Code:
        active[actives] = c;
        actives++;
    Therefore, if actives > 0, there is at least one open parens/brace/bracket, and the innermost one (latest encountered) is active[actives - 1]. Due to the increment happening after the assignment.

    If we look at the code you pointed out:
    Code:
                if (c == ')' || c == ']' || c == '}') {
                    if (actives < 1) {
                        fflush(stdout);
                        fprintf(stderr, "Line %lu: '%c' without a prior '%c'.\n", line, c, pair(c));
                    } else
                    if (active[actives - 1] != pair(c)) {
                        fflush(stdout);
                        fprintf(stderr, "Line %lu: '%c', but expected '%c'.\n", line, c, pair(active[actives - 1]));
                    } else
                        actives--;
                }
    The first inner if statement just verifies there is at least one open parens/brace/bracket, so we don't try to access a char before the actual array (that would be a buffer underrun bug).

    The second inner if statement can be rewritten as
    Code:
                    if (active[actives - 1] == pair(c))
                        actives--;
                    else /* error */
    As I said above, because actives gets incremented after the open parens/brace/bracket is added to the array, active[actives - 1] is always the innermost open one.

    Quote Originally Posted by thames View Post
    I still don't understand since the routine only decrements if the pairs coincide.
    Exactly!

    Remember, this code is the one that verifies C quotes, parentheses, brackets, and braces, the logic you mentioned in post #6. It does not implement the logic you specified in your initial post and what I described in post #3.

    I'm sorry I confused everyone; I just got engrossed with the tangential, and lost the main thread here.

    The code I tested my implementation of the logic in post #3 is below. It is very sparsely commented, since I did not intend to show it to anyone at any point, only quickly threw it together to test the logic I described. I haven't even checked it for any bugs, either. It is also limited to 1023 characters per content sequence.
    Code:
    #include <stdio.h>
    
    int main(void)
    {
        char        buffer[1024];
        char *const q = buffer - 1 + sizeof buffer;
        char       *p = buffer;
        int         head, tail, c;
    
        c = getc(stdin);
    
        while (1) {
    
            /* Whitespace state. */
            while (c == '\t' || c == '\n' || c == '\v' ||
                   c == '\f' || c == '\r' || c == ' ') {
                putc(c, stdout);
                c = getc(stdin);
            }
    
            /* No more input? */
            if (c == EOF)
                break;
    
            /* Head delimiter? */
            if (c == '(') {
                head = '(';
                tail = ')';
                c = getc(stdin);
    
            } else
            if (c == '{') {
                head = '{';
                tail = '}';
                c = getc(stdin);
    
            } else
            if (c == '[') {
                head = '[';
                tail = ']';
                c = getc(stdin);
    
            } else
            if (c == '<') {
                head = '<';
                tail = '>';
                c = getc(stdin);
    
            } else {
                head = EOF;
                tail = EOF; /* Unknown -- must not match any real characters! */
                /* We have the first content char in c already,
                 * so no c = getc(stdin) here. */
            }
    
            /* Clear buffer. */
            p = buffer;
    
            /* Content state. */
            while (c != EOF  && c != tail &&
                   c != '('  && c != '{'  && c != '[' && c != '<' &&
                   c != ')'  && c != '}'  && c != ']' && c != '>' &&
                   c != '\t' && c != '\n' && c != '\v' &&
                   c != '\f' && c != '\r' && c != ' ') {
    
                /* Add to buffer, if there is room. */
                if (p < q)
                    *(p++) = c;
    
                c = getc(stdin);
            }
    
            /* Add terminating NUL byte to the buffer. */
            *p = '\0';
    
            /* Did we expect a specific tail delimiter? */
            if (tail != EOF) {
                /* Yes. If we got it, consume it. */
                if (c == tail)
                    c = getc(stdin);
    
            } else {
                /* No, we don't know the delimiters yet.
                 * See if the tail delimiter defines the pair. */
                if (c == ')') {
                    head = '(';
                    tail = ')';
                    c = getc(stdin);
                } else
                if (c == ']') {
                    head = '[';
                    tail = ']';
                    c = getc(stdin);
                } else
                if (c == '}') {
                    head = '{';
                    tail = '}';
                    c = getc(stdin);
                } else
                if (c == '<') {
                    head = '<';
                    tail = '>';
                    c = getc(stdin);
                } else {
                    /* No delimiters. Use [ ] by default. */
                    head = '[';
                    tail = ']';
                    /* Since c is not a delimeter, we do not
                     * consume it here. */
                }
            }
    
            /* Return to whitespace state. */
            putc(head, stdout);
            fputs(buffer, stdout);
            putc(tail, stdout);
        }
    
        return 0;
    }
    In fact, if I knew I'd show it at some point, I would have written at least two helper functions: one to tell if a character is a non-content character, and another to return the paired delimiter (or none if not a paired delimiter); that should shorten the loop quite a bit, and make the code much easier to read.

  10. #25
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    This version of my program that implements the original logic described in this thread is probably easier to follow:
    Code:
    #include <stdio.h>
    
    /* Return nonzero (true) if c is an ASCII whitespace character.
    */
    static inline int is_whitespace(const int c)
    {
        return (c == '\t' || c == '\n' || c == '\v' ||
                c == '\f' || c == '\r' || c == ' ');
    }
    
    /* Return nonzero (true) if c is an open parens/brace/whatever.
    */
    static inline int is_opener(const int c)
    {
        return (c == '(' || c == '<' || c == '[' || c == '{');
    }
    
    /* Return nonzero (true) if c is an close parens/brace/whatever.
    */
    static inline int is_closer(const int c)
    {
        return (c == ')' || c == '>' || c == ']' || c == '}');
    }
    
    /* Return zero (false) if c is EOF, whitespace, opener, or closer.
    */
    static inline int is_content(const int c)
    {
        return (c != EOF &&
                !is_whitespace(c) &&
                !is_opener(c) &&
                !is_closer(c));
    }
    
    /* Return the close parens/brace/whatever that corresponds to c.
    */
    static inline int closer_of(const int c)
    {
        switch (c) {
        case '(': return ')';
        case '<': return '>';
        case '[': return ']';
        case '{': return '}';
        default:  return '\0';
        }
    }
    
    /* Return the open parens/brace/whatever that corresponds to c.
    */
    static inline int opener_of(const int c)
    {
        switch (c) {
        case ')': return '(';
        case '>': return '<';
        case ']': return '[';
        case '}': return '{';
        default:  return '\0';
        }
    }
    
    int main(void)
    {
        char        buffer[1024];
        char *const q = buffer - 1 + sizeof buffer;
        char       *p = buffer;
        int         head, tail, c;
    
        c = getc(stdin);
    
        while (1) {
    
            /*
             * State: Between content
            */
    
            while (is_whitespace(c)) {
                putc(c, stdout);
                c = getc(stdin);
            }
    
            /* No more input? */
            if (c == EOF)
                break;
    
            /*
             * Transition to state: Content
            */
    
            if (is_opener(c)) {
                /* Yes, save and consume. */
                head = c;
                tail = closer_of(c);
                c = getc(stdin);
    
            } else {
                /* We do not know the opener yet. */
                head = EOF;
                tail = EOF;
            }
    
            /* Clear buffer. */
            p = buffer;
    
            /*
             * State: Content
            */
    
            while (is_content(c)) {
    
                /* Add to buffer, if there is room. */
                if (p < q)
                    *(p++) = c;
    
                c = getc(stdin);
            }
    
            /*
             * Transition to state: Between content
            */
    
            /* Add terminating NUL byte to the buffer. */
            *p = '\0';
    
            /* Did we expect a specific tail delimiter? */
            if (is_closer(tail)) {
                /* Yes. If we got it, consume it. */
                if (c == tail)
                    c = getc(stdin);
    
            } else {
                /* No, we don't know the delimiters yet.
                 * See if the tail delimiter defines the pair. */
                if (is_closer(c)) {
                    /* It is a closer, so yes, it defines. */
                    head = opener_of(c);
                    tail = c;
                    c = getc(stdin);
                } else {
                    /* No, use [ ] by default. */
                    head = '[';
                    tail = ']';
                    /* Since c is not a delimeter, we do not
                     * consume it here. */
                }
            }
    
            /* Print the head delimiter, */
            putc(head, stdout);
    
            /* the content (if any; it may be empty), */
            fputs(buffer, stdout);
    
            /* and the tail delimiter. */
            putc(tail, stdout);
        }
    
        return 0;
    }
    If you read it side-by-side with the logic I described in post #3, it should match.

  11. #26
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    Remember, this code is the one that verifies C quotes, parentheses, brackets, and braces, the logic you mentioned in post #6. It does not implement the logic you specified in your initial post and what I described in post #3.

    I'm sorry I confused everyone; I just got engrossed with the tangential, and lost the main thread here.
    You don't need to feel sorry; I knew what the code was doing because I tested it.
    I will also analyse the easier program. Many thanks for your effort people.

  12. #27
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    char *const q = buffer - 1 + sizeof buffer;
    are you assigning to q the buffer minus the last index (\0') ?
    Last edited by thames; 11-04-2012 at 07:30 PM.

  13. #28
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    Code:
     p = buffer
    cleaning buffer ... I thought you were assigning again the address of buffer to the p pointer.

  14. #29
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by thames View Post
    are you assigning to q the buffer minus the last index (\0') ?
    Yes, q is a pointer to the last char in the buffer. I use it to avoid buffer overrun, while keeping one char reserved for the final '\0'.

    Quote Originally Posted by thames View Post
    Code:
     p = buffer
    cleaning buffer ... I thought you were assigning again the address of buffer to the p pointer.
    Exactly, but because p points to the end of the content in the buffer, it "clears" the buffer in the logical sense.

    Remember, comments should state the intent behind the code. Anyone can read the code and see what it does, but comments should state why.

    That assignment resets our buffer pointer (current end of content pointer) to the start of the buffer, thus "clearing" it, even though it does not modify the buffer contents. It does not have to modify anything, because there is an unconditional *p = '\0' before we use the buffer contents as a string.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. dsa signing and verifying
    By kakashi316 in forum C Programming
    Replies: 4
    Last Post: 05-06-2012, 02:21 PM
  2. Need help with a Sudoku verifying program
    By Skeeter in forum C Programming
    Replies: 3
    Last Post: 10-30-2009, 08:15 PM
  3. verifying image compression
    By elninio in forum C++ Programming
    Replies: 2
    Last Post: 06-17-2008, 07:36 PM
  4. Verifying int with while(!(cin >> num))
    By motarded in forum C++ Programming
    Replies: 3
    Last Post: 02-26-2006, 10:37 PM
  5. Verifying single digit input
    By Syked4 in forum C Programming
    Replies: 8
    Last Post: 05-31-2005, 07:11 PM

Tags for this Thread