[C] remove comments

**Tool** · 11-18-2009

Ok. So i added this to the program:

--dquote state in which comments do not get removed -- \n removes dquote state, backlash ignores the next character (stays in dquote)
--squote state in which comments do not get removed -- \n switches it back to normal state
--replaces \r\n with a \n and \r with a \n
--trigraph replacements before the main while loop: replaces all the trigraphs with the appropriate characters. How i fixed this problem is i replaced this sequence, for example ??< with __{ --> 2 blanks and then {. Should i approach this problem diffrently, or this is fine?

What i didnt really understand about Aksel's post is the line merging.

Code:

#include <stdio.h>
#include <stdlib.h>

#define MAX_SIZE 10000

int main(int argc, char *argv[])
{
    
     /* tmp:
        K&R2 1-23: Write a program to remove all comments from a C program.
        Don't forget to handle quoted strings and character constants
        properly. C comments do not nest.
     */
      
     int c;
     int x=0;                  
     char array[MAX_SIZE];
     enum states { normal,
                   comment,
                   dquote, dquote_escape,
                   squote };
     int state = normal;
     int valid=1;
         
     /*
     FILE *f;
     f = fopen("zad88.c", "r");
     if (f == NULL) return 1;
     */ 
        
     
     /*while((c=fgetc(f))!=EOF) */
     while((c=getchar())!=EOF)
     if(x<MAX_SIZE)
     array[x++]=c;
     else
     return 1;
     
     array[x]='\n';
     array[++x]='\0';
     
     int pom=x;
     x=0;
     
     
     /* trigraph replacement */
     while(x!=pom)
     {
        if(array[x] == '\?' && array[x+1]=='\?') 
        {
           x+=2;
           valid=1;
           
           switch(array[x])
           {
               
               case '=':
               array[x]='#';  break;
               
               case '/':
               array[x]='\\'; break;
               
               case '\'':
               array[x]='^'; break;
               
               case '(':
               array[x]='['; break;
               
               case ')':
               array[x]=']'; break;
               
               case '!':
               array[x]='|'; break;
               
               case '<':
               array[x]='{'; break;
               
               case '>':
               array[x]='}'; break;
               
               case '-':
               array[x]='~'; break;
               
               default:
               valid=0;
               x-=2;
           }
           
           if(valid==1)
           array[x-1]=array[x-2]=' ';
        }
        x++;
     }
     
     x=0;
     
     while(x!=pom)
     {
        
        if(array[x]=='/' && array[x+1] == '*' && state == normal) { array[x]=' '; state = comment; x++;}
        else if(array[x]=='*' && array[x+1] == '/' && state == comment) { array[x]=array[x+1]=' '; state = normal; x++; }
     
        else if(array[x]=='"' && state == normal) { state = dquote; }
        else if(array[x]=='\\' && state == dquote) { state = dquote_escape; }
        else if(state == dquote_escape) { state = dquote; }
        else if(state == dquote && array[x]=='\n') { state = normal; }
        else if(array[x]=='"' && state == dquote) { state = normal; }
        
        else if(array[x] == '\'' && state == normal) { state = squote; }
        else if(array[x] == '\n' && state == squote) { state = normal; }
        else if(array[x] == '\'' && state == squote) { state = normal; }
        
        else if(array[x] == '\r') { array[x]='\n'; }
        else if(array[x] == '\r' && array[x+1]=='\n') { array[x]='\n'; array[x+1]=' '; }
                
        
        if(state == comment)
        array[x]=' ';
        
        x++;
     } 
        
     
     printf("%s\n", array);   
     
     
     
 
  printf("Press any key to continue.\n");	
  getchar();
  return 0;
}

**King Mir** · 11-18-2009

Originally Posted by Tool

--trigraph replacements before the main while loop: replaces all the trigraphs with the appropriate characters. How i fixed this problem is i replaced this sequence, for example ??< with __{ --> 2 blanks and then {. Should i approach this problem diffrently, or this is fine?

Nope, this won't work. It does not work in literal strings. It does not work when ??=??= or #??= is used in Macros. It does not work when ??!??! or |??! is used for boolean or.

**Tool** · 11-19-2009

Is there a site that has all the conditions when trigraphs work, and when not? Like some standard?

**Aksel** · 11-19-2009

The standard:

http://www.open-std.org/jtc1/sc22/wg...docs/n1336.pdf

Compilers often leave trigraphs off unless explicitly enabled. Why? Because they are annoying and rarely used.

Trigraphs belong to translation phase 1, so they come before pretty much everything else.

I do not recall ever having seen source code containing trigraphs (unless deliberately obfuscated in general).

**laserlight** · 11-19-2009

Originally Posted by Aksel

The standard:

A draft of the C standard, actually.

Originally Posted by Aksel

I do not recall ever having seen source code containing trigraphs (unless deliberately obfuscated in general).

Same here. In fact, I would have to look them up if I did see them in source code, assuming that I recognised them as such.

**Tool** · 11-19-2009

Code:

5.2.1.1 Trigraph sequences
1 Before any other processing takes place, each occurrence of one of the following
sequences of three characters (called trigraph sequences13)) is replaced with the
corresponding single character.
??= #
??( [
??/ \
??) ]
??' ^
??< {
??! |
??> }
??- ~
No other trigraph sequences exist. Each ? that does not begin one of the trigraphs listed
above is not changed.
2 EXAMPLE 1
??=define arraycheck(a, b) a??(b??) ??!??! b??(a??)
becomes
#define arraycheck(a, b) a[b] || b[a]
3 EXAMPLE 2 The following source line
printf("Eh???/n");
becomes (after replacement of the trigraph sequence ??/)
printf("Eh?\n");

I have added trigraph check, and changed it a bit. It works just as intended now, according to the standard above.

Other then this, i dont know what else could i add to the program, any ideas?

**Aksel** · 11-19-2009

Originally Posted by laserlight

A draft of the C standard, actually.

True, I could not find a link for the real deal, probably because it costs $$$. I haven't got it either. I am a cheapskate :-).

Originally Posted by laserlight

Same here. In fact, I would have to look them up if I did see them in source code, assuming that I recognised them as such.

Exactly. In C++ you also have the digraphs to keep the maintainers on their toes.

**laserlight** · 11-19-2009

Originally Posted by Aksel

True, I could not find a link for the real deal, probably because it costs $$$. I haven't got it either. I am a cheapskate :-).

Well, if you really do want the real deal legally, you can purchase a PDF copy from the ANSI online store. But generally a draft will suffice for Tool's purposes (and the draft that you linked to is more recent than the most recent published version of the standard).

**Aksel** · 11-19-2009

ANSI online store

Yes, if only I could figure out which one to buy. If I base my decision on price, it would be this one:

American National Standards Institute - ANSI eStandards Store

But there is also this one:

BS ISO/IEC 9899:1999 Programming languages. C

I wonder what the difference is?

Nevermind, the draft is good enough for me.

**laserlight** · 11-19-2009

Originally Posted by Aksel

I wonder what the difference is?

The latter is a hardcopy. Save the trees!

**Aksel** · 11-19-2009

They do not sell hard copies at all, so it must be something else.

**brewbuck** · 11-19-2009

Trigraphs are processed at the lowest level, before tokenization. Thus, trigraph translation occurs even inside string literals and character literals. Digraphs, on the other hand, are only translated within contexts where a macro expansion could occur -- that does NOT include string literals and character literals.

The best design would probably be a fundamental GetNextChar() function which performs trigraph expansion (this will involve some lookahead and ungetting). Layered on top of this would be a tokenizer which grabs tokens (handling digraphs, character and string literals appropriately) as well as the whitespace itself (so that you can precisely reproduce the original whitespace). The tokenizer would return the source token verbatim, unless it is a comment, in which case it would just continue to the next token.

**Tool** · 11-19-2009

I didnt really read anything about ungetc functions, so im not really familiar with it.

I did the trigraph replacement through an array, and it works just as the standard suggests, not with while((c=getchar())!=EOF).

Do you mind writing a pseudocode for the

The best design would probably be a fundamental GetNextChar() function

**Tool** · 11-21-2009

I think im done with everything in this exercise, just didnt add the code for octal/hexadecimal character constants.

Looks preety much perfected to me...

If anyone feels like testing and trying to find an error heres the code:
(would appreciate any feedback from a pro

)

Code:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    
     int c;
     enum states { normal,
                   squote, squote2, squote3, squote4, squote_escape,
                   comment_entry, comment, comment_exit, comment_exit2, comment2,
                   dquote, dquote_escape,
                   octal, hexa};
     int state = normal; 
     int second; /* stores the tmp character after slash */
     int flag=0; /* if 1, puts the character stored in second */
     
     while((c=getchar())!=EOF)
     {
         if(state == normal && c=='\'') { state = squote; }
         else if(state == normal && c=='"') { state = dquote; }
         else if(state == normal && c=='/')
         { 
              second=getchar(); 
              if(second=='*') 
              {
                 ungetc(c, stdin); 
                 state = comment; 
              }
              
              else if(second=='/') 
              {
                 ungetc(c, stdin);
                 state = comment2; 
              }
              
              else if(second=='\'') {state = squote; }
              else if(second=='"') { state = dquote; }
              
              else flag=1;
         }    
      
        
         else if(state == comment && c=='*') { state = comment_exit; }
         else if(state == comment) {}
         
         else if(state == comment_exit && c=='*') {}
         else if(state == comment_exit && c=='/') { putchar(' '); state = comment_exit2; }
         else if(state == comment_exit) { state = comment; }
         
         else if(state == comment2 && c=='\n') { state = normal; }
         else if(state == comment2) { }         
         
         else if(state == squote && c=='\n') {  state = normal; }
         else if(state == squote && c=='\t') { state = squote4; }
         else if(state == squote && c=='\'') { state = normal; }
         else if(state == squote && c=='\\') { state = squote_escape; }
         else if(state == squote) { state = squote2; }
         
         else if(state == squote_escape && c=='\n') { state = normal; }
         else if(state == squote_escape && c=='\t') { state = squote4; }
         else if(state == squote_escape && (c>='0' && c<='7')) { state = octal;  }
         else if(state == squote_escape && c=='x') { state = hexa; }
         else if(state == squote_escape) { state = squote2; }
         
         else if(state == squote2 && c=='\'') { state = normal; }
         else if(state == squote2 && c=='\n') { state = normal; }
         else if(state == squote2 && c=='\t') { state = squote4; }
         else if(state == squote2 && c=='\\') { state = squote3; }
         else if(state == squote2) { state = squote4; }
         
         else if(state == squote3 && c=='\n') { state = normal; }
         else if(state == squote3 && c=='\t') { state = squote4; }
         else if(state == squote3) { state = squote4; }
         
         else if(state == squote4 && c=='\\') { state = squote3; }
         else if(state == squote4 && c=='\n') { state = normal; }
         else if(state == squote4 && c=='\t') { }
         else if(state == squote4 && c=='\'') { state = normal; }
         else if(state == squote4) { }
         
         else if(state == dquote && c=='\n') { state = normal; }
         else if(state == dquote && c=='"') { state = normal; }
         else if(state == dquote && c=='\\') { state = dquote_escape; }
         else if(state == dquote) { }
         
         else if(state ==dquote_escape) { state = dquote; }
         
         if(state == normal || 
         state == squote || state == squote2 || state == squote3 || state == squote4 || state == squote_escape ||
         state == dquote || state == dquote_escape || state == octal || state == hexa && 
         state!=comment_exit2) 
         putchar(c); 
         
         if(state==comment_exit2) { state = normal; }     
         if(flag==1) { flag=0; putchar(second); }
         
      }
     
                    
    printf("Press any key to continue");   
    getchar();	
    return 0;
}

**Aksel** · 11-22-2009

Originally Posted by Tool

I think im done with everything in this exercise, just didnt add the code for octal/hexadecimal character constants.

Looks preety much perfected to me...

It has improved a lot. It does not handle this contrived situation:

Code:

// Look, I am being clever, using line-continuation to turn a single-line comment\
into a multi-line comment.

GCC will warn you about such multi-line comments if you use "-Wall".

If you use -Wall, GCC will also warn you about

Code:

    if(state == normal || 
    state == squote || state == squote2 || state == squote3 || state == squote4 || state == squote_escape ||
    state == dquote || state == dquote_escape || state == octal || state == hexa && 
    state!=comment_exit2)

I think it is safe to remove the last part "&& state != comment_exit" which is a given.