Removing comments from a source stream is easier to do if you write a very simple finite-state machine: read each input character one by one, decide what to do based on the state of your machine (usually just a loop), including what state to change to.
If you consider C and C++ style comments, then you have the following states, and changes to a new state based on the character seen:
Code:
NORMAL_CODE: Within normal non-comment code
/ ⇒ AFTER_SLASH
" ⇒ DOUBLE_QUOTED, note (A)
' ⇒ SINGLE_QUOTED, note (A)
others ⇒ NORMAL_CODE, note (A)
SINGLE_QUOTED: A substate of NORMAL_CODE, single-quoted strings
' ⇒ NORMAL_CODE, note(A)
others ⇒ SINGLE_QUOTED, note (A)
DOUBLE_QUOTED: A substate of NORMAL_CODE, double-quoted strings
" ⇒ NORMAL_CODE, note(A)
others ⇒ DOUBLE_QUOTED, note (A)
AFTER_SLASH: After a single slash in normal code
/ ⇒ CPP_COMMENT
* ⇒ C_COMMENT
others ⇒ NORMAL_CODE, note (B)
CPP_COMMENT: Within a // comment
newline ⇒ NORMAL_CODE
* ⇒ C_COMMENT
others ⇒ CPP_COMMENT, note (C)
C_COMMENT: Within a /* comment
* ⇒ C_COMMENT_ASTERISK
others ⇒ C_COMMENT
C_COMMENT_ASTERISK: After a * within a /* comment
* ⇒ C_COMMENT_ASTERISK
/ ⇒ NORMAL_CODE
others ⇒ C_COMMENT, note (D)
Note (A): Output the current character before the transition,
so that you keep (don't filter out) NORMAL_CODE.
Note (B): You need to output a slash before the transition, because
the slash that caused the original transition from NORMAL_CODE
to AFTER_SLASH was not output.
Note (C): You should output a newline before the transition.
The newline is not part of the comment, really; the newline
ends the comment, and the line the comment was on.
Note (D): No output needed. The asterisk was part of the comment,
and you skip comments.
The arrows define transitions to ne states (when a specific character is seen).
To understand how the code works, open up a example input in a second window, keeping one finger in the state above. Whenever you look at the next character of input, look below the state to see which state you need to move your finger to. The notes tell you if there are any side effects you also need to do. Then you just keep doing that until there is no more input!
One way to implement the above is a very straightforward loop. In pseudocode:
Code:
state = NORMAL_CODE
Loop reading new char from input, until no more input:
if state == NORMAL_CODE:
if char == /:
state = AFTER_SLASH
else if char == ':
output '
state = SINGLE_QUOTED
else if char == ":
output "
state = DOUBLE_QUOTED
else:
output char
if state == SINGLE_QUOTED:
output char
if char == ':
state = NORMAL_CODE
if state == DOUBLE_QUOTED:
output char
if char == ":
state = NORMAL_CODE
else if state == AFTER_SLASH:
if char == /:
state = CPP_COMMENT
else if char == *:
state = C_COMMENT
else if char == ':
output /
output '
state = SINGLE_QUOTED
else if char == ":
output /
output "
state = DOUBLE_QUOTED
else:
output /
output char
state = NORMAL_CODE
else if state == CPP_COMMENT:
if char == newline:
state = NORMAL_CODE
else if state == C_COMMENT:
if char == *:
state = C_COMMENT_ASTERISK
else if state == C_COMMENT_ASTERISK
if char == /:
state = NORMAL_CODE
else if char == *:
state = C_COMMENT_ASTERISK
else:
state = C_COMMENT
else:
state has an invalid value; abort
End loop
Note that SINGLE_QUOTED and DOUBLE_QUOTED are just special "sub-states" of NORMAL_CODE.
AFTER_SLASH is such a "sub-state" of NORMAL_CODE too; all except the asterisk and slash character cases are shared with NORMAL_CODE. For simplicity, I duplicated the tests.
In case you do understand the logic above, but are unsure on how to start, start with this:
Code:
#include <stdio.h>
enum comment_states {
NORMAL_CODE = 0,
AFTER_SLASH,
CPP_COMMENT,
C_COMMENT,
C_COMMENT_ASTERISK
};
int main(void)
{
enum comment_states state = NORMAL_CODE;
int c;
while (EOF != (c = fgetc(stdin)))
switch (state) {
case AFTER_SLASH:
if (c == '/') {
state = CPP_COMMENT;
break;
} else
if (c == '*') {
state = C_COMMENT;
break;
}
fputc('/', stdout);
state = NORMAL_CODE;
/* Fall through to NORMAL_CODE */
case NORMAL_CODE:
if (c == '/')
state = AFTER_SLASH;
else {
fputc(c, stdout);
if (c == '"')
state = DOUBLE_QUOTED;
else
if (c == '\'')
state = SINGLE_QUOTED;
}
break;
/* TODO: Other five cases */
}
return 0;
}
The above saves some code, because instead of duplicating the quoted cases in AFTER_SLASH, the above shares the same code by falling through to NORMAL_CODE in those cases.
While the above is very hard to understand at the first go, the entire comment-removing program, including the ability to ignore slashes and asterisks in string constants, is just 80 lines of code (nicely formatted, not compacted, no tricks)!
Finite state machines are a very useful tool for any programmer. I do recommend reading about them, even if you cannot grasp them at first. They are fundamentally simple, and most people use relaxed versions of them in their real life instinctively, all the time. It just tends to be difficult at first to wrap your mind around the concepts.
The hardest part is learning how to design a good state machine. The key, I think, is being very anal retentive about considering all cases, then switch to being as lazy as possible so you can trim out unneeded cases, often by folding them into existing ones.