Originally Posted by
phantomotap
I decided to confirm my assumption, but on reading your post again I guess you didn't aim in that direction?
You're absolutely right, sorry!
I took off on the tangent thames stated in post #6, but neglected to mention it. Ouch.
Before post #6, the code and logic I've posted targets the problem thames stated in the original post in this thread.
After post #6, my code and logic targets the K&R problem thames mentioned: "Write a program to check a C program for rudimentary syntax errors like unbalanced parentheses, brackets and braces. Don't forget about quotes, both single and double, escape sequences, and comments. "
Apologies for derailing the thread.
Originally Posted by
phantomotap
To see what I'm talking about put a single '\'' character on line 151 (That would be before applying the `else' fix which I did for consistency.). The error you get is unrelated to the '\'' character because the code treats several lines as being part of the character literal.
Yes. There are two reasons why that happens.
First, the state machine progresses through the code, and assumes the code thus far is correct. When it encounters a single quote, it must assume it is the start of a character literal.
Second, the state machine treats string literals and character literals exactly the same way; it does not make any assumptions about the length or contents of the character literal. This is truly a bug, because string literals are not allowed to contain unescaped newlines, whereas string literals are.
Fixing the bug does seem to make it yield correct output. Adding fflush(stdout); just before outputting any error messages will also help pinpoint the location of the error, since the error message should appear right after the character that made the machine to recognize the error was output.
Let me also say a big thank you. Not only just because you've pointed at least two bugs in my code already, but you have made me think about the state machine from different perspective, not just from the viewpoints that come to me naturally. I do my best work in a team where that happens and is encouraged; where discovering holes and bugs and weaknesses in ones implementation is genuinely appreciated. I definitely do, my code can always benefit from another pair of sharp eyes.
If you or anybody else have the time and interest to test it, here is the latest version of my code with both fixes and the enhancements included:
Code:
#include <stdio.h>
#define MAX_ACTIVES 1024
enum comment_states {
NORMAL_CODE = 0,
SINGLE_QUOTED,
DOUBLE_QUOTED,
AFTER_SLASH,
CPP_COMMENT,
C_COMMENT,
C_COMMENT_ASTERISK
};
static unsigned long line = 1UL;
static inline int next(void)
{
int c;
c = getc(stdin);
if (c == '\n') {
line++;
c = getc(stdin);
if (c != '\r') {
ungetc(c, stdin);
fputc('\n', stdout);
} else
fputs("\n\r", stdout);
return '\n';
} else
if (c == '\r') {
line++;
c = getc(stdin);
if (c != '\n') {
ungetc(c, stdin);
fputc('\r', stdout);
} else
fputs("\r\n", stdout);
return '\n';
} else
if (c != EOF) {
fputc(c, stdout);
return c;
}
return EOF;
}
static inline int pair(const int c)
{
switch (c) {
case '(': return ')';
case ')': return '(';
case '[': return ']';
case ']': return '[';
case '{': return '}';
case '}': return '{';
default: return '\0';
}
}
int main(void)
{
enum comment_states state = NORMAL_CODE;
char active[MAX_ACTIVES];
int actives = 0;
int c;
while (EOF != (c = next()))
switch (state) {
case AFTER_SLASH:
if (c == '/') {
state = CPP_COMMENT;
break;
} else
if (c == '*') {
state = C_COMMENT;
break;
}
state = NORMAL_CODE;
case NORMAL_CODE:
if (c == '/')
state = AFTER_SLASH;
else
if (c == '"')
state = DOUBLE_QUOTED;
else
if (c == '\'')
state = SINGLE_QUOTED;
else
if (c == '(' || c == '[' || c == '{') {
if (actives >= MAX_ACTIVES) {
fflush(stdout);
fprintf(stderr, "Line %lu: Too deep nesting.\n", line);
return 1;
}
active[actives++] = c;
} else
if (c == ')' || c == ']' || c == '}') {
if (actives < 1) {
fflush(stdout);
fprintf(stderr, "Line %lu: '%c' without a prior '%c'.\n", line, c, pair(c));
} else
if (active[actives - 1] != pair(c)) {
fflush(stdout);
fprintf(stderr, "Line %lu: '%c', but expected '%c'.\n", line, c, pair(active[actives - 1]));
} else
actives--;
}
break;
case SINGLE_QUOTED:
if (c == '\\')
next();
else
if (c == '\'')
state = NORMAL_CODE;
else
if (c == '\n') {
fflush(stdout);
fprintf(stderr, "Line %lu: Missing terminating ' character.\n", line - 1UL);
state = NORMAL_CODE;
}
break;
case DOUBLE_QUOTED:
if (c == '\\')
next();
else
if (c == '\"')
state = NORMAL_CODE;
break;
case CPP_COMMENT:
if (c == '\n')
state = NORMAL_CODE;
break;
case C_COMMENT:
if (c == '*')
state = C_COMMENT_ASTERISK;
break;
case C_COMMENT_ASTERISK:
if (c == '/')
state = NORMAL_CODE;
else
if (c != '*')
state = C_COMMENT;
break;
}
fflush(stdout);
if (state == C_COMMENT || state == C_COMMENT_ASTERISK)
fprintf(stderr, "Line %lu: Expected end of comment */ before end of input.\n", line);
else
if (state == DOUBLE_QUOTED)
fprintf(stderr, "Line %lu: Expected '\"' before end of input.\n", line);
else
if (state == SINGLE_QUOTED)
fprintf(stderr, "Line %lu: Expected '\'' before end of input.\n", line);
while (actives > 0)
fprintf(stderr, "Line %lu: Expected '%c' before end of input.\n", line, pair(active[--actives]));
return 0;
}
While the code is nowhere near a C lexer (scanner), to me it is surprisingly compact compared to what it achieves.
Considering the C syntax, I think that if you wanted to add new features like detecting missing semicolons, it would be easier to overhaul the state machine, and make it into a genuine C parser/scanner instead. The state map would become completely different, but the basic structure should not be that much different.
C syntax is complex enough that I'd actually turn to a parser generator. For example, GNU Bison can generate the C, C++, or Java code needed, and you can find syntax files for Bison to generate a C parser on the net.