Thread: C++ parser/ simplified lexical analyzer

  1. #1
    Registered User
    Join Date
    Jul 2008
    Posts
    72

    C++ parser/ simplified lexical analyzer

    I'm looking for a C++ parser/simplified lexical analyzer that pulls the data from one file and puts it into another txt file. The input files look similar to this one..

    Code:
    class A {
    int a[11],x,y,z;
    char *oneString;
    public:
    A() { oneString= new char[100]; }
    ˜A() { delete oneString; }
    void f();
    };
    void A::f() {
    int temp;
    x=y+1;
    z=x+2; z= x*x;
    }
    Does anybody know where I can find an example? I haven't had any luck so far at all.

    P.S. Those are C++ reserved words, functions etc. So the list would look like this:

    Int
    char
    a
    x
    class

    etc.
    Last edited by XodoX; 09-07-2010 at 02:54 PM.

  2. #2
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    So you're just trying to create a lexer for C++ then? Separate the file out into tokens and then place one token per line in a separate file or something like that? [edit] Or even just pulling out the reserved words by themselves? [/edit]

    That's not too difficult. You can probably create your own lexer for C++ if you have a list of keywords and are using a tool like flex. If you want to actually parse the C++ code and create an abstract syntax tree representation, that's much harder. There are entire projects devoted to this, such as Elsa. But although Elsa is very good, even it isn't perfect.

    Maybe you should describe exactly what you're trying to do, and why, and what tools you plan to use.
    Last edited by dwks; 09-07-2010 at 03:42 PM.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  3. #3
    Registered User
    Join Date
    Jul 2008
    Posts
    72
    It's just supposed to pull the data from the input file and create an output file. Like a simplified lexical analyzer. I can't find an example online though.

  4. #4
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    That's too vague. What you you mean? What data? What will the format of the output file be? Your example seems to indicate that you want C++ keywords as well as identifiers, is that correct?

    I'd start by reading lines from the input file. Then you can take each line and break it apart at whitespace, or alphanumeric/non-alphanumeric character divisions, or whatever takes your fancy. For example:
    Code:
    #include <iostream>
    #include <string>
    #include <cctype>  // for std::isspace(), etc.
    
    void print_tokens(const std::string &data);
    
    int main() {
        std::string line;
        
        while(std::getline(std::cin, line)) {
            print_tokens(line);
        }
        
        return 0;
    }
    
    void print_tokens(const std::string &data) {
        enum {
            MODE_OUTSIDE,
            MODE_IN_WORD,
            MODE_SYMBOL
        } mode = MODE_OUTSIDE;
        
        std::string::size_type x = 0;
        while(x < data.length()) {
            char c = data[x];
            switch(mode) {
            case MODE_OUTSIDE:
                if(std::isspace(c)) {
                    x ++;
                }
                else if(std::isalnum(c)) {
                    mode = MODE_IN_WORD;
                    std::cout << '"';
                }
                else {
                    mode = MODE_SYMBOL;
                }
                break;
            case MODE_IN_WORD:
                if(std::isalnum(c)) {
                    std::cout << c;
                    x ++;
                }
                else {
                    std::cout << '"' << std::endl;
                    mode = MODE_OUTSIDE;
                }
                break;
            case MODE_SYMBOL:
                if(!std::isalnum(c) && !std::isspace(c)) {
                    std::cout << "[" << c << "]\n";
                    x ++;
                }
                else {
                    mode = MODE_OUTSIDE;
                }
                break;
            }
        }
    }
    I'm afraid that's not the best example, but I guess it's okay. I use a finite state machine to remember the type of the previous character: it's either OUTSIDE (like whitespace or the beginning of the line), or IN_WORD (meaning inside an identifier), or SYMBOL (meaning anything else, like '{'). The switch statement looks at the previous type, and outlines the actions that can be taken based on the current character. For example, if you're OUTSIDE and you see a letter, you transition to IN_WORD. If you're already IN_WORD and you see a letter, you print the letter and go back to IN_WORD. (If this was a more complete example I'd be building up a std::string containing the current word and then printing it upon a transition out of IN_WORD and into another state.)

    The first couple of lines of output when run on itself are as follows.
    Code:
    $ ./lexcpp < lexcpp.cpp | head -n 50
    [#]
    "include"
    [<]
    "iostream"
    [>]
    [#]
    "include"
    [<]
    "string"
    [>]
    [#]
    "include"
    [<]
    "cctype"
    [>]
    [/]
    [/]
    "for"
    "std"
    [:]
    [:]
    "isspace"
    [(]
    [)]
    [,]
    "etc"
    [.]
    "void"
    "print"
    [_]
    "tokens"
    [(]
    "const"
    "std"
    [:]
    [:]
    "string"
    [&]
    "data"
    [)]
    [;]
    "int"
    "main"
    [(]
    [)]
    [{]
    "std"
    [:]
    [:]
    "string"
    [edit] As you can see, it would require more work to understand C++ better. It doesn't know that "//" starts a comment, for example, or that "::" is a keyword. I should probably have used "std::isalnum(c) || c == '_'", because it thinks "print_tokens" is "print" [_] "tokens".

    I've written lots of little programs that understand programming languages to some degree or another. The best way to tackle the problem is by using a finite state machine like I have above. The code colouring above, for example, was done with codeform (specifically the online version), which is a somewhat confusing program that I wrote to look for keywords and make them colourful. [/edit]
    Last edited by dwks; 09-07-2010 at 04:22 PM.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  5. #5
    Registered User
    Join Date
    Jul 2008
    Posts
    72
    I mean the code in the input file. The code that I posted in the OP. That's the input file. The program takes the input file and then lists whatever is in there. So it needs to have a path to the input file.
    Perhaps like this.

    Code:
    #include <string>
    #include <fstream>
    #include <iostream>
     
    int main()
    {
    	// Open file for input
    	std::ifstream ifs("input.txt");
     
    	std::string line; // string to contain each line
     
    	
    	while(std::getline(ifs, line))
    	{
    		// deal with each line here...
    		std::cout << line << std::endl;
    	}
     
    	return 0;
    }
    Last edited by XodoX; 09-07-2010 at 04:27 PM.

  6. #6
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    What do you mean by "lists whatever is in there"? Be specific!

    Opening a file is the easy part.
    Code:
    $ cat wordsaslines.cpp
    #include <iostream>
    #include <fstream>
    #include <string>
    
    int main(int argc, char *argv[]) {
        if(argc != 2) {
            std::cerr << "Usage: " << argv[0] << " input-file\n";
            return 1;
        }
    
        std::ifstream input(argv[1]);
        if(!input.is_open()) {
            std::cerr << "Error opening file \"" << argv[1] << "\"\n";
        }
    
        std::string line;
        while(std::getline(input, line)) {
            std::cout << "Got line: [" << line << "]\n";
        }
    
        return 0;
    }
    $ g++ wordsaslines.cpp -o wordsaslines
    $ ./wordsaslines wordsaslines.cpp
    Got line: [#include <iostream>]
    Got line: [#include <fstream>]
    Got line: [#include <string>]
    Got line: []
    Got line: [int main(int argc, char *argv[]) {]
    Got line: [    if(argc != 2) {]
    Got line: [        std::cerr << "Usage: " << argv[0] << " input-file\n";]
    Got line: [        return 1;]
    Got line: [    }]
    Got line: [    ]
    Got line: [    std::ifstream input(argv[1]);]
    Got line: [    if(!input.is_open()) {]
    Got line: [        std::cerr << "Error opening file \"" << argv[1] << "\"\n";]
    Got line: [    }]
    Got line: [    ]
    Got line: [    std::string line;]
    Got line: [    while(std::getline(input, line)) {]
    Got line: [        std::cout << "Got line: [" << line << "]\n";]
    Got line: [    }]
    Got line: [    ]
    Got line: [    return 0;]
    Got line: [}]
    $
    Writing to an output file is symmetrical except you use std::ofstream ("output file stream") instead.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  7. #7
    Registered User
    Join Date
    Jul 2008
    Posts
    72
    Like I said in the OP. Like those codes in the input file.

    Code:
    class A {
    int a[11],x,y,z;
    char *oneString;
    public:
    A() { oneString= new char[100]; }
    ˜A() { delete oneString; }
    void f();
    };
    void A::f() {
    int temp;
    x=y+1;
    z=x+2; z= x*x;
    }
    Code:
    int sum(int a,int b) { return a+b; }
    void main()
    {
    int i,x1,x2,r; int a[11];
    for(i=1;i<=10;i++) a[i]=0;
    x1=2; x2=10;
    a[1]= x1*x2;
    r= sum(x1,x2);
    }
    The output would be like this.

    class
    A
    int
    a
    x
    y
    z
    char
    oneString
    public
    new
    delete
    void
    f
    temp
    It just lists it like this. I'm guessing your code in #4 would just do that, if you just add the open file and create an output file.
    Last edited by XodoX; 09-07-2010 at 04:39 PM.

  8. #8
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Yes, it probably would. Although you could make it much simpler by scanning over each character, throwing away anything that isn't a letter and printing a newline at the same time (and setting a flag to make sure you only print one newline).

    Post again if you still have questions.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Lexical Analyzer
    By ammad in forum C++ Programming
    Replies: 8
    Last Post: 11-18-2009, 06:59 PM
  2. simple lexical analyzer
    By ^Son_Gokou08 in forum C Programming
    Replies: 6
    Last Post: 08-25-2009, 07:55 AM
  3. Scanner? Lexical analyzer? Tokenizer?
    By audinue in forum A Brief History of Cprogramming.com
    Replies: 8
    Last Post: 12-23-2008, 11:32 PM
  4. Lexical analyzer for C
    By nishkarsh in forum C Programming
    Replies: 4
    Last Post: 08-26-2008, 08:05 AM
  5. Problem with a file parser.
    By Hulag in forum C++ Programming
    Replies: 7
    Last Post: 03-17-2005, 09:54 AM