Thread: Removing Comments

  1. #1
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193

    Removing Comments

    Good afternoon. I'm trying to remove comments from an input:


    Code:
    #include <stdio.h> 
    
    #define MAXLINE 100
    
    int getLine(char*, int);
    void removeComment(char*);
    
    int main() 
    { 
      char str[MAXLINE];
      while(getLine(str,MAXLINE) > 0)
        removeComment(str);  
      printf("String without comments: %s", str);   
      return 0;    
    }     
    
    int getLine(char* s, int lim)
    { 
       int i;
       char c;
       for(i = 0; i < lim - 1 && (c = getchar()) != EOF && c != '\n'; i++) 
        s[i] = c;
        
       if(c == '\n')
         s[i++] = c; 
       
       s[i] = '\0'; 
       return i; 
    }
    
    void removeComment(char* str) 
    { 
      int i;
      for(i = 0; *str != '\0'; i++)           
      { 
        if( (str[i] == '/' && str[i+1] == '*') || (str[i] == '*' && str[i+1] == '/') )
        { 
          *(str + i) = ' ';
          *(str + i + 1) = ' ';
        }            
      } 
    }
    Then I get:

    Code:
    Starting program: /home/thames/C/removecomments 
    /* this is a comment */   
    
    Breakpoint 1, removeComment (str=0x7fffffffe670 "/* this is a comment */\n") at removecomments.c:38
    38          *(str + i) = ' ';
    (gdb) cont
    Continuing.
    
    Breakpoint 2, removeComment (str=0x7fffffffe670 " * this is a comment */\n") at removecomments.c:39
    39          *(str + i + 1) = ' ';
    (gdb) cont
    Continuing.
    
    Breakpoint 1, removeComment (str=0x7fffffffe670 "   this is a comment */\n") at removecomments.c:38
    38          *(str + i) = ' ';
    (gdb) cont
    Continuing.
    
    Breakpoint 2, removeComment (str=0x7fffffffe670 "   this is a comment  /\n") at removecomments.c:39
    39          *(str + i + 1) = ' ';
    (gdb) cont
    Continuing.
    
    Program received signal SIGSEGV, Segmentation fault.
    0x00000000004006b0 in removeComment (str=0x7fffffffe670 "   this is a comment   \n") at removecomments.c:36
    36        if( (str[i] == '/' && str[i+1] == '*') || (str[i] == '*' && str[i+1] == '/') )
    (gdb)
    why what I'm doing is illegal?

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I notice that your for loop condition is *str != '\0', but you never actually change str. Consequently, str[i] eventually goes out of bounds. Perhaps you intended to compare with str[i] != '\0' instead, or something like that.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    Code:
    void removeComment(char* str) 
    { 
      int i = 0;
      while(*str++ != '\0')
      { 
        if( (str[i] == '/' && str[i+1] == '*') || (str[i] == '*' && str[i+1] == '/') )
        { 
          *(str + i) = ' ';
          *(str + i + 1) = ' ';
        }            
      } 
    }
    now the string disappeared :/

    Code:
    String without comments:

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I think you need to decide: do you want to iterate over the string with a pointer or with an index? If you want to use a pointer, then get rid of the i index. If you want to use an index, then you should not be changing the value of str. (Well, you can do both, but you need to be extra careful, and there's no reason to complicate matters here.)
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Man, I think we really need a program to add comments.
    Mainframe assembler programmer by trade. C coder when I can.

  6. #6
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193

    Question

    Quote Originally Posted by laserlight View Post
    and there's no reason to complicate matters here.)
    I was doing it for trainning.

    Code:
    void removeComment(char* str) 
    { 
      int i;
      for(i = 0; str[i] != '\0'; i++)
      { 
        if( (str[i] == '/' && str[i+1] == '*') || (str[i] == '*' && str[i+1] == '/') )
        { 
          *(str + i) = ' ';
          *(str + i + 1) = ' ';
        }            
      } 
    }
    Why is the output blank?

  7. #7
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by thames
    Why is the output blank?
    My guess is that the output is blank because you read line by line, and the last line is a blank line (or contained a comment marker that was removed). Notice that you only print the last line.

    Take a look at your function:
    Code:
    void removeComment(char* str)
    {
      int i;
      for(i = 0; str[i] != '\0'; i++)
      {
        if( (str[i] == '/' && str[i + 1] == '*') || (str[i] == '*' && str[i + 1] == '/') )
        {
          str[i] = ' ';
          str[i + 1] = ' ';
        }
      }
    }
    I have taken the liberty of converting to use array index notation. Now, the problem here is that your function does not remove comments. Rather, it removes comment markers by replacing them with spaces.

    When I think about the process of removing comments, I tend to think in terms of a state machine: at first, we begin scanning in a "non-comment" state: in this state, whatever input we get is kept as output, until an opening comment marker is detected. When an opening comment marker is detected, we enter a "comment" state: in this state, whatever input we get is discarded (or replaced by a space), until a closing comment marker is detected. When a closing comment marker is detected, we re-enter the "non-comment" state.

    One thing to note is that the state must persist between calls of the function since the opening and closing comment markers may be on different lines.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  8. #8
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Removing comments from a source stream is easier to do if you write a very simple finite-state machine: read each input character one by one, decide what to do based on the state of your machine (usually just a loop), including what state to change to.

    If you consider C and C++ style comments, then you have the following states, and changes to a new state based on the character seen:
    Code:
        NORMAL_CODE: Within normal non-comment code
                / ⇒ AFTER_SLASH
                " ⇒ DOUBLE_QUOTED, note (A)
                ' ⇒ SINGLE_QUOTED, note (A)
           others ⇒ NORMAL_CODE, note (A)
    
        SINGLE_QUOTED: A substate of NORMAL_CODE, single-quoted strings
                ' ⇒ NORMAL_CODE, note(A)
           others ⇒ SINGLE_QUOTED, note (A)
    
        DOUBLE_QUOTED: A substate of NORMAL_CODE, double-quoted strings
                " ⇒ NORMAL_CODE, note(A)
           others ⇒ DOUBLE_QUOTED, note (A)
    
        AFTER_SLASH: After a single slash in normal code
                / ⇒ CPP_COMMENT
                * ⇒ C_COMMENT
           others ⇒ NORMAL_CODE, note (B)
    
        CPP_COMMENT: Within a // comment
          newline ⇒ NORMAL_CODE
                * ⇒ C_COMMENT
           others ⇒ CPP_COMMENT, note (C)
          
        C_COMMENT: Within a /* comment
                * ⇒ C_COMMENT_ASTERISK
           others ⇒ C_COMMENT
    
        C_COMMENT_ASTERISK: After a * within a /* comment
                * ⇒ C_COMMENT_ASTERISK
                / ⇒ NORMAL_CODE
           others ⇒ C_COMMENT, note (D)
    
    
        Note (A): Output the current character before the transition,
                  so that you keep (don't filter out) NORMAL_CODE.
    
        Note (B): You need to output a slash before the transition, because
                  the slash that caused the original transition from NORMAL_CODE
                  to AFTER_SLASH was not output.
    
        Note (C): You should output a newline before the transition.
                  The newline is not part of the comment, really; the newline
                  ends the comment, and the line the comment was on.
    
        Note (D): No output needed. The asterisk was part of the comment,
                  and you skip comments.
    The arrows define transitions to ne states (when a specific character is seen).

    To understand how the code works, open up a example input in a second window, keeping one finger in the state above. Whenever you look at the next character of input, look below the state to see which state you need to move your finger to. The notes tell you if there are any side effects you also need to do. Then you just keep doing that until there is no more input!

    One way to implement the above is a very straightforward loop. In pseudocode:
    Code:
    state = NORMAL_CODE
    
    Loop reading new char from input, until no more input:
    
        if state == NORMAL_CODE:
            if char == /:
                state = AFTER_SLASH
            else if char == ':
                output '
                state = SINGLE_QUOTED
            else if char == ":
                output "
                state = DOUBLE_QUOTED
            else:
                output char
    
        if state == SINGLE_QUOTED:
            output char
            if char == ':
                state = NORMAL_CODE
    
        if state == DOUBLE_QUOTED:
            output char
            if char == ":
                state = NORMAL_CODE
    
        else if state == AFTER_SLASH:
            if char == /:
                state = CPP_COMMENT
            else if char == *:
                state = C_COMMENT
            else if char == ':
                output /
                output '
                state = SINGLE_QUOTED
            else if char == ":
                output /
                output "
                state = DOUBLE_QUOTED
            else:
                output /
                output char
                state = NORMAL_CODE
    
        else if state == CPP_COMMENT:
            if char == newline:
                state = NORMAL_CODE
    
        else if state == C_COMMENT:
            if char == *:
                state = C_COMMENT_ASTERISK
    
        else if state == C_COMMENT_ASTERISK
            if char == /:
                state = NORMAL_CODE
            else if char == *:
                state = C_COMMENT_ASTERISK
            else:
                state = C_COMMENT
    
        else:
             state has an invalid value; abort
    
    End loop
    Note that SINGLE_QUOTED and DOUBLE_QUOTED are just special "sub-states" of NORMAL_CODE.

    AFTER_SLASH is such a "sub-state" of NORMAL_CODE too; all except the asterisk and slash character cases are shared with NORMAL_CODE. For simplicity, I duplicated the tests.

    In case you do understand the logic above, but are unsure on how to start, start with this:
    Code:
    #include <stdio.h>
    
    enum comment_states {
        NORMAL_CODE = 0,
        AFTER_SLASH,
        CPP_COMMENT,
        C_COMMENT,
        C_COMMENT_ASTERISK
    };
    
    int main(void)
    {
        enum comment_states  state = NORMAL_CODE;
        int  c;
    
        while (EOF != (c = fgetc(stdin)))
            switch (state) {
    
            case AFTER_SLASH:
                if (c == '/') {
                    state = CPP_COMMENT;
                    break;
                } else
                if (c == '*') {
                    state = C_COMMENT;
                    break;
                }
                fputc('/', stdout);
                state = NORMAL_CODE;
                /* Fall through to NORMAL_CODE */
    
            case NORMAL_CODE:
                if (c == '/')
                    state = AFTER_SLASH;
                else {
                    fputc(c, stdout);
                    if (c == '"')
                        state = DOUBLE_QUOTED;
                    else
                    if (c == '\'')
                        state = SINGLE_QUOTED;
                }
                break;
    
            /* TODO: Other five cases */
    
            }
    
        return 0;
    }
    The above saves some code, because instead of duplicating the quoted cases in AFTER_SLASH, the above shares the same code by falling through to NORMAL_CODE in those cases.

    While the above is very hard to understand at the first go, the entire comment-removing program, including the ability to ignore slashes and asterisks in string constants, is just 80 lines of code (nicely formatted, not compacted, no tricks)!

    Finite state machines are a very useful tool for any programmer. I do recommend reading about them, even if you cannot grasp them at first. They are fundamentally simple, and most people use relaxed versions of them in their real life instinctively, all the time. It just tends to be difficult at first to wrap your mind around the concepts.

    The hardest part is learning how to design a good state machine. The key, I think, is being very anal retentive about considering all cases, then switch to being as lazy as possible so you can trim out unneeded cases, often by folding them into existing ones.

  9. #9
    Tears of the stars thames's Avatar
    Join Date
    Oct 2012
    Location
    Rio, Brazil
    Posts
    193
    Thanks for explaining me what a State Machine is and what I can do with it. I understood that I can control the flow according with the state I'm in. Also, I keep printing thanks to the stream provided by stdin and stdout. I'll follow laserlight advice about replacing the comments with blanks or discarding the input. By the way, how can I discard it? with fflush(stdin) ? many thanks.
    Last edited by thames; 10-30-2012 at 07:05 PM.

  10. #10
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,907
    fflush(stdin)
    Absolutely not.

    When you are in the state where you are reading the comment section, don't save/print it - Just check to see if the end of the comment is found. Once it is found, go back to the other state that.

    Remember that you still have to read it in comment mode to see if the *\ is found
    Fact - Beethoven wrote his first symphony in C

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. removing comments of type '//' and '/*'
    By rohit83.ken in forum C++ Programming
    Replies: 3
    Last Post: 10-20-2007, 02:24 AM
  2. removing comments of type '//' and '/*'
    By rohit83.ken in forum C Programming
    Replies: 2
    Last Post: 10-19-2007, 10:14 AM
  3. Removing comments from textual files
    By Micko in forum C++ Programming
    Replies: 19
    Last Post: 08-09-2006, 09:36 AM
  4. comments please...
    By kiranck007 in forum C Programming
    Replies: 4
    Last Post: 02-01-2006, 06:47 AM
  5. Comments
    By cyberCLoWn in forum C++ Programming
    Replies: 10
    Last Post: 01-22-2004, 08:05 PM