Program drops ws and other tokens

This is a discussion on Program drops ws and other tokens within the C++ Programming forums, part of the General Programming Boards category; This is supposed to strip c- and cpp-style comments from a "program" that is read from standard input; input need ...

  1. #1
    Caution: Wet Floor
    Join Date
    May 2006
    Posts
    55

    Program drops ws and other tokens

    This is supposed to strip c- and cpp-style comments from a "program" that is read from standard input; input need not be a working program. Output should be whatever was passed to stdin minus comments.

    The program recognizes comments, but it also eats tokens and doesn't recognize whitespace:

    Code:
    #include <iostream>
    #include <string>
    using namespace std;
    
    enum Token_value {
    
      CL, CR, CPP,
      S='/', A='*', Q ='"', N='\n',
      END
    
    };
    
    Token_value curr_tok = END;
    
    bool cppcomment;
    bool comment;
    bool literal;
    
    string buf;
    
    Token_value get_token();
    
    void remove() {
    
      curr_tok = get_token();
    
      switch(curr_tok) {
      case S:
        if(curr_tok == get_token() && !comment && !literal) {
          curr_tok = CPP;
          cppcomment = true;
          comment = true;
        }
        if(curr_tok == A && !comment && !literal){
          curr_tok = CL;
          comment = true;
        }
        break;
      case A:
        get_token();
        if(curr_tok == S && !comment && !literal) {
          curr_tok = CR;
          comment = false;
        }
        break;
      case Q:
        if(literal) literal = false;
        else literal = true;
        break;
    
      case N:
        if(cppcomment) {
          cppcomment = false;
          comment = false;
        }
      default:
        break;
      }
    
    }
    
    
    Token_value get_token() {
    
      char ch = 0;
    
      do {
        if(!cin.get(ch)) return curr_tok = END;
      } while (ch != '\n' && isspace(ch));
    
      switch(ch) {
    
      case '/':
        return curr_tok=S;
      case '*':
        return curr_tok=A;
      case '"':
        return curr_tok=Q;
      case '\n':
        return curr_tok=N;
    
     default:
        if(!comment) buf.push_back(ch);
      }
    }
    
    int main() {
    
      cin.unsetf(ios::skipws);
    
      while(cin) {
        remove();
        if(curr_tok==END) break;
      }
      cout << buf << '\n';
    }
    Example session
    Code:
    example% ./strip
    int main() {
    
    // This is a comment
    "This is /*not*/ a //comment"
    whitespace should be preserved here
    }
    
    output:
    intmain(){Thisisnotacommentwhitespaceshouldbepreservedhere}
    Other problems:

    -- Selection statements have confusing conditions;
    -- too many global variables for this program.

    Thanks ; )

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677

    Arrow

    Are you INTENTIONALLY making this hard to read by using enum, or is it your belief that this is more readable?

    I would much prefer to see the following (as an example, similar changes should be done throughout the code):
    Code:
     switch(curr_tok) {
      case '/':
        if(curr_tok == get_token() && !comment && !literal) {
          curr_tok = CPP;
          cppcomment = true;
          comment = true;
        }
        if(curr_tok == '*' && !comment && !literal){
          curr_tok = CL;
          comment = true;
        }
        break;
      case '*':
        get_token();
        if(curr_tok == '/' && !comment && !literal) {
          curr_tok = CR;
          comment = false;
        }
        break;
    ....
    Token_value get_token() {
    
      char ch = 0;
    
      do {
        if(!cin.get(ch)) return curr_tok = END;
      } while (ch != '\n' && isspace(ch));
    
      switch(ch) {
    
      case '/':
      case '*':
      case '"':
      case '\n':
        curr_tok = ch;
        return ch;
    
     default:
        if(!comment) buf.push_back(ch);
      }
    }
    If you feel that you MUST use symbols, at the very least make them understandable, e.g. "Quote" instead of "Q", "Slash" instead of "/", etc.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Caution: Wet Floor
    Join Date
    May 2006
    Posts
    55
    Sorry.

    I'm just curious why this thing doesn't preserve the symbols ('\n', '*', '/', etc.) when writing to stdout.

  4. #4
    The larch
    Join Date
    May 2006
    Posts
    3,573
    That is easy:
    Code:
      switch(ch) {
    
      case '/':
        return curr_tok=S;
      case '*':
        return curr_tok=A;
      case '"':
        return curr_tok=Q;
      case '\n':
        return curr_tok=N;
    
     default:
        if(!comment) buf.push_back(ch);
      }
    Which characters end up in buf?

    As to why whitespace is not preserved... Well, this loop is designed to throw away whitespace:
    Code:
      do {
        if(!cin.get(ch)) return curr_tok = END;
      } while (ch != '\n' && isspace(ch));
    Edit:
    This program is broken in many more ways. For example get_token doesn't return anything in the default case. The output seems to depend a lot on what gets returned in this case (e.g returning curr_tok is not good).
    Last edited by anon; 12-13-2007 at 12:36 PM.
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  5. #5
    Caution: Wet Floor
    Join Date
    May 2006
    Posts
    55
    Thanks. I threw the first version out and started over, but ended up with something just as awkward:

    Code:
    #include <iostream>
    #include <string>
    using namespace std;
    
    #define BACK '\177' // backspace
    
    char curr_tok = 0;
    char prev_tok = 0;
    
    struct Flags {
      bool c;
      bool cpp;
    };
    
    string buf="";
    
    int main() {
    
      char ch = 0;
      int level_c = 0;
      bool quote = false;
    
      Flags comment = {false, false};
    
      while(cin.get(ch)) {
    
        curr_tok = ch;
        if(!comment.c && !comment.cpp) {
          buf.push_back(curr_tok);
        }
    
        switch(ch) {
    
        case '/':
          if(curr_tok == prev_tok && quote == false) {
            comment.cpp = true;
            buf.push_back(BACK);
            buf.push_back(BACK);
          }
    
          if(prev_tok == '*' && quote == false) {
            comment.c = false;
    
          }
          break;
        case '*':
          if(prev_tok == '/' && quote == false) {
            comment.c = true;
            buf.push_back(BACK);
            buf.push_back(BACK);
          }
          break;
    
        case '"':
          if(!quote && prev_tok != '\\') {
            quote = true;
          }
         else if(prev_tok != '\\') quote = false;
    
        case '\n':
          if(comment.cpp) buf.push_back(curr_tok);
          comment.cpp = false;
    
        }
    
        if(quote == true) {
          comment.c = false;
          comment.cpp = false;
        }
        prev_tok = curr_tok;
      }
    
      cout << buf;
    }
    What could be cleaned up here?

  6. #6
    The larch
    Join Date
    May 2006
    Posts
    3,573
    You are putting backspaces into your output rather than just erasing unwanted characters?
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by anon View Post
    You are putting backspaces into your output rather than just erasing unwanted characters?
    No, he's putting DELETE into the output stream - that may or may not do the same thing as a backspace - more likely NOT.

    I would probably use some sort of "lookahead" to identify the multicharacter tokens [//, /*, */], and use cin.putback() to "undo" the reading if it turns out that there it's not what I wanted. That's traditionally how these things work.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,414
    An easier way would just be to read line-by-line, do a search for //, if not found print, if found, copy everything until // to buffer/screen, if /* is found copy everything before it to screen/buffer, search for */ and restart from there.
    I'll leave it up to you to make good of that suggestion.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Popular pages Recent additions subscribe to a feed

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21