Thread: [C] remove comments

  1. #31
    Registered User
    Join Date
    Jul 2009
    Location
    Croatia
    Posts
    272
    Ok. So i added this to the program:

    --dquote state in which comments do not get removed -- \n removes dquote state, backlash ignores the next character (stays in dquote)
    --squote state in which comments do not get removed -- \n switches it back to normal state
    --replaces \r\n with a \n and \r with a \n
    --trigraph replacements before the main while loop: replaces all the trigraphs with the appropriate characters. How i fixed this problem is i replaced this sequence, for example ??< with __{ --> 2 blanks and then {. Should i approach this problem diffrently, or this is fine?

    What i didnt really understand about Aksel's post is the line merging.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    #define MAX_SIZE 10000
    
    int main(int argc, char *argv[])
    {
        
         /* tmp:
            K&R2 1-23: Write a program to remove all comments from a C program.
            Don't forget to handle quoted strings and character constants
            properly. C comments do not nest.
         */
          
         int c;
         int x=0;                  
         char array[MAX_SIZE];
         enum states { normal,
                       comment,
                       dquote, dquote_escape,
                       squote };
         int state = normal;
         int valid=1;
             
         /*
         FILE *f;
         f = fopen("zad88.c", "r");
         if (f == NULL) return 1;
         */ 
            
         
         /*while((c=fgetc(f))!=EOF) */
         while((c=getchar())!=EOF)
         if(x<MAX_SIZE)
         array[x++]=c;
         else
         return 1;
         
         array[x]='\n';
         array[++x]='\0';
         
         int pom=x;
         x=0;
         
         
         /* trigraph replacement */
         while(x!=pom)
         {
            if(array[x] == '\?' && array[x+1]=='\?') 
            {
               x+=2;
               valid=1;
               
               switch(array[x])
               {
                   
                   case '=':
                   array[x]='#';  break;
                   
                   case '/':
                   array[x]='\\'; break;
                   
                   case '\'':
                   array[x]='^'; break;
                   
                   case '(':
                   array[x]='['; break;
                   
                   case ')':
                   array[x]=']'; break;
                   
                   case '!':
                   array[x]='|'; break;
                   
                   case '<':
                   array[x]='{'; break;
                   
                   case '>':
                   array[x]='}'; break;
                   
                   case '-':
                   array[x]='~'; break;
                   
                   default:
                   valid=0;
                   x-=2;
               }
               
               if(valid==1)
               array[x-1]=array[x-2]=' ';
            }
            x++;
         }
         
         x=0;
         
         while(x!=pom)
         {
            
            if(array[x]=='/' && array[x+1] == '*' && state == normal) { array[x]=' '; state = comment; x++;}
            else if(array[x]=='*' && array[x+1] == '/' && state == comment) { array[x]=array[x+1]=' '; state = normal; x++; }
         
            else if(array[x]=='"' && state == normal) { state = dquote; }
            else if(array[x]=='\\' && state == dquote) { state = dquote_escape; }
            else if(state == dquote_escape) { state = dquote; }
            else if(state == dquote && array[x]=='\n') { state = normal; }
            else if(array[x]=='"' && state == dquote) { state = normal; }
            
            else if(array[x] == '\'' && state == normal) { state = squote; }
            else if(array[x] == '\n' && state == squote) { state = normal; }
            else if(array[x] == '\'' && state == squote) { state = normal; }
            
            else if(array[x] == '\r') { array[x]='\n'; }
            else if(array[x] == '\r' && array[x+1]=='\n') { array[x]='\n'; array[x+1]=' '; }
                    
            
            if(state == comment)
            array[x]=' ';
            
            x++;
         } 
            
         
         printf("%s\n", array);   
         
         
         
     
      printf("Press any key to continue.\n");	
      getchar();
      return 0;
    }
    Last edited by Tool; 11-18-2009 at 02:03 PM.

  2. #32
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Quote Originally Posted by Tool View Post
    --trigraph replacements before the main while loop: replaces all the trigraphs with the appropriate characters. How i fixed this problem is i replaced this sequence, for example ??< with __{ --> 2 blanks and then {. Should i approach this problem diffrently, or this is fine?
    Nope, this won't work. It does not work in literal strings. It does not work when ??=??= or #??= is used in Macros. It does not work when ??!??! or |??! is used for boolean or.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  3. #33
    Registered User
    Join Date
    Jul 2009
    Location
    Croatia
    Posts
    272
    Is there a site that has all the conditions when trigraphs work, and when not? Like some standard?

  4. #34
    Registered User
    Join Date
    Nov 2009
    Posts
    7
    The standard:

    http://www.open-std.org/jtc1/sc22/wg...docs/n1336.pdf

    Compilers often leave trigraphs off unless explicitly enabled. Why? Because they are annoying and rarely used.

    Trigraphs belong to translation phase 1, so they come before pretty much everything else.

    I do not recall ever having seen source code containing trigraphs (unless deliberately obfuscated in general).
    Last edited by laserlight; 11-19-2009 at 06:50 AM. Reason: Fixed link URL.

  5. #35
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Aksel
    The standard:
    A draft of the C standard, actually.

    Quote Originally Posted by Aksel
    I do not recall ever having seen source code containing trigraphs (unless deliberately obfuscated in general).
    Same here. In fact, I would have to look them up if I did see them in source code, assuming that I recognised them as such.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  6. #36
    Registered User
    Join Date
    Jul 2009
    Location
    Croatia
    Posts
    272
    Code:
    5.2.1.1 Trigraph sequences
    1 Before any other processing takes place, each occurrence of one of the following
    sequences of three characters (called trigraph sequences13)) is replaced with the
    corresponding single character.
    ??= #
    ??( [
    ??/ \
    ??) ]
    ??' ^
    ??< {
    ??! |
    ??> }
    ??- ~
    No other trigraph sequences exist. Each ? that does not begin one of the trigraphs listed
    above is not changed.
    2 EXAMPLE 1
    ??=define arraycheck(a, b) a??(b??) ??!??! b??(a??)
    becomes
    #define arraycheck(a, b) a[b] || b[a]
    3 EXAMPLE 2 The following source line
    printf("Eh???/n");
    becomes (after replacement of the trigraph sequence ??/)
    printf("Eh?\n");
    I have added trigraph check, and changed it a bit. It works just as intended now, according to the standard above.

    Other then this, i dont know what else could i add to the program, any ideas?
    Last edited by Tool; 11-19-2009 at 07:06 AM.

  7. #37
    Registered User
    Join Date
    Nov 2009
    Posts
    7
    Quote Originally Posted by laserlight View Post
    A draft of the C standard, actually.
    True, I could not find a link for the real deal, probably because it costs $$$. I haven't got it either. I am a cheapskate :-).

    Quote Originally Posted by laserlight View Post
    Same here. In fact, I would have to look them up if I did see them in source code, assuming that I recognised them as such.
    Exactly. In C++ you also have the digraphs to keep the maintainers on their toes.

  8. #38
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Aksel
    True, I could not find a link for the real deal, probably because it costs $$$. I haven't got it either. I am a cheapskate :-).
    Well, if you really do want the real deal legally, you can purchase a PDF copy from the ANSI online store. But generally a draft will suffice for Tool's purposes (and the draft that you linked to is more recent than the most recent published version of the standard).
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  9. #39
    Registered User
    Join Date
    Nov 2009
    Posts
    7
    Yes, if only I could figure out which one to buy. If I base my decision on price, it would be this one:

    American National Standards Institute - ANSI eStandards Store

    But there is also this one:

    BS ISO/IEC 9899:1999 Programming languages. C

    I wonder what the difference is?

    Nevermind, the draft is good enough for me.
    Last edited by Aksel; 11-19-2009 at 08:13 AM.

  10. #40
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Aksel
    I wonder what the difference is?
    The latter is a hardcopy. Save the trees!
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  11. #41
    Registered User
    Join Date
    Nov 2009
    Posts
    7
    They do not sell hard copies at all, so it must be something else.

  12. #42
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Trigraphs are processed at the lowest level, before tokenization. Thus, trigraph translation occurs even inside string literals and character literals. Digraphs, on the other hand, are only translated within contexts where a macro expansion could occur -- that does NOT include string literals and character literals.

    The best design would probably be a fundamental GetNextChar() function which performs trigraph expansion (this will involve some lookahead and ungetting). Layered on top of this would be a tokenizer which grabs tokens (handling digraphs, character and string literals appropriately) as well as the whitespace itself (so that you can precisely reproduce the original whitespace). The tokenizer would return the source token verbatim, unless it is a comment, in which case it would just continue to the next token.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  13. #43
    Registered User
    Join Date
    Jul 2009
    Location
    Croatia
    Posts
    272
    I didnt really read anything about ungetc functions, so im not really familiar with it.

    I did the trigraph replacement through an array, and it works just as the standard suggests, not with while((c=getchar())!=EOF).

    Do you mind writing a pseudocode for the
    The best design would probably be a fundamental GetNextChar() function

  14. #44
    Registered User
    Join Date
    Jul 2009
    Location
    Croatia
    Posts
    272
    I think im done with everything in this exercise, just didnt add the code for octal/hexadecimal character constants.

    Looks preety much perfected to me...

    If anyone feels like testing and trying to find an error heres the code:
    (would appreciate any feedback from a pro )
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    int main(int argc, char *argv[])
    {
        
         int c;
         enum states { normal,
                       squote, squote2, squote3, squote4, squote_escape,
                       comment_entry, comment, comment_exit, comment_exit2, comment2,
                       dquote, dquote_escape,
                       octal, hexa};
         int state = normal; 
         int second; /* stores the tmp character after slash */
         int flag=0; /* if 1, puts the character stored in second */
         
         while((c=getchar())!=EOF)
         {
             if(state == normal && c=='\'') { state = squote; }
             else if(state == normal && c=='"') { state = dquote; }
             else if(state == normal && c=='/')
             { 
                  second=getchar(); 
                  if(second=='*') 
                  {
                     ungetc(c, stdin); 
                     state = comment; 
                  }
                  
                  else if(second=='/') 
                  {
                     ungetc(c, stdin);
                     state = comment2; 
                  }
                  
                  else if(second=='\'') {state = squote; }
                  else if(second=='"') { state = dquote; }
                  
                  else flag=1;
             }    
          
            
             else if(state == comment && c=='*') { state = comment_exit; }
             else if(state == comment) {}
             
             else if(state == comment_exit && c=='*') {}
             else if(state == comment_exit && c=='/') { putchar(' '); state = comment_exit2; }
             else if(state == comment_exit) { state = comment; }
             
             else if(state == comment2 && c=='\n') { state = normal; }
             else if(state == comment2) { }         
             
             else if(state == squote && c=='\n') {  state = normal; }
             else if(state == squote && c=='\t') { state = squote4; }
             else if(state == squote && c=='\'') { state = normal; }
             else if(state == squote && c=='\\') { state = squote_escape; }
             else if(state == squote) { state = squote2; }
             
             else if(state == squote_escape && c=='\n') { state = normal; }
             else if(state == squote_escape && c=='\t') { state = squote4; }
             else if(state == squote_escape && (c>='0' && c<='7')) { state = octal;  }
             else if(state == squote_escape && c=='x') { state = hexa; }
             else if(state == squote_escape) { state = squote2; }
             
             else if(state == squote2 && c=='\'') { state = normal; }
             else if(state == squote2 && c=='\n') { state = normal; }
             else if(state == squote2 && c=='\t') { state = squote4; }
             else if(state == squote2 && c=='\\') { state = squote3; }
             else if(state == squote2) { state = squote4; }
             
             else if(state == squote3 && c=='\n') { state = normal; }
             else if(state == squote3 && c=='\t') { state = squote4; }
             else if(state == squote3) { state = squote4; }
             
             else if(state == squote4 && c=='\\') { state = squote3; }
             else if(state == squote4 && c=='\n') { state = normal; }
             else if(state == squote4 && c=='\t') { }
             else if(state == squote4 && c=='\'') { state = normal; }
             else if(state == squote4) { }
             
             else if(state == dquote && c=='\n') { state = normal; }
             else if(state == dquote && c=='"') { state = normal; }
             else if(state == dquote && c=='\\') { state = dquote_escape; }
             else if(state == dquote) { }
             
             else if(state ==dquote_escape) { state = dquote; }
             
             if(state == normal || 
             state == squote || state == squote2 || state == squote3 || state == squote4 || state == squote_escape ||
             state == dquote || state == dquote_escape || state == octal || state == hexa && 
             state!=comment_exit2) 
             putchar(c); 
             
             if(state==comment_exit2) { state = normal; }     
             if(flag==1) { flag=0; putchar(second); }
             
          }
         
                        
        printf("Press any key to continue");   
        getchar();	
        return 0;
    }
    Last edited by Tool; 11-21-2009 at 11:38 AM.

  15. #45
    Registered User
    Join Date
    Nov 2009
    Posts
    7
    Quote Originally Posted by Tool View Post
    I think im done with everything in this exercise, just didnt add the code for octal/hexadecimal character constants.

    Looks preety much perfected to me...
    It has improved a lot. It does not handle this contrived situation:

    Code:
    // Look, I am being clever, using line-continuation to turn a single-line comment\
    into a multi-line comment.
    GCC will warn you about such multi-line comments if you use "-Wall".

    If you use -Wall, GCC will also warn you about

    Code:
        if(state == normal || 
        state == squote || state == squote2 || state == squote3 || state == squote4 || state == squote_escape ||
        state == dquote || state == dquote_escape || state == octal || state == hexa && 
        state!=comment_exit2)
    I think it is safe to remove the last part "&& state != comment_exit" which is a given.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Remove comments
    By St0rM-MaN in forum C Programming
    Replies: 4
    Last Post: 05-18-2007, 11:03 PM
  2. program to remove comments from source
    By Abda92 in forum C Programming
    Replies: 12
    Last Post: 12-25-2006, 05:18 PM
  3. Request for comments
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-02-2004, 10:33 AM
  4. The Art of Writing Comments :: Software Engineering
    By kuphryn in forum C++ Programming
    Replies: 15
    Last Post: 11-23-2002, 05:18 PM
  5. remove comments from source code
    By limbo100 in forum C Programming
    Replies: 2
    Last Post: 09-29-2001, 06:25 PM