Thread: Review my simple Lexer please..

  1. #1
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657

    Review my simple Lexer please..

    Hello (...)
    I wrote this a while ago, while making a calculator.
    This time I'm trying to write a simplistic interpreter.

    So, any suggestions to improve the following code (please read the comments carefully) for reuse?
    Assume that the constructor for the token class works correctly, it(the token class) obviously needs to be remade for this crazy idea !

    Code:
    #include "lexer.h"
    #include "token.h"
    #include <iterator>
    #include <string>
    //#include <iostream>  //activate if needed..
    //using std::cout;
    using std::list;
    using std::string;
    list<token> tokenize(string s)
    {
        list<token> lt;
        string::iterator sit;
    
        enum State {nil,num,ifr,sym} state(nil);
            //^nothing,number,identifier,other single char symbols
    
        int start,length; //indicating start and lenght of substr
    
        char cur(0); //current char
    
        string section; //substr'ed string
    
        bool uniflag(false);
            //If a section is ready to be pushed_back
    
        for(sit=s.begin();sit!=s.end();sit++)
        {
            uniflag = false;
            cur = *sit;
            if(cur==' ')  //for skipping whitespaces && pusing back preceding ones
            {
                if(state==nil)  //Just Skip
                    continue;
                else  //i.e if state is num, ifr or sym
                {//read this block carefully...
    
                    length=sit - s.begin() - start ;
                        //sit (an iterator ) - s.begin()
                        //gives the current position.
                        //^*that - start (an int)
                        //gives the length :D
    
                    section = s.substr(start,length);
    
                    uniflag = true;
                        //Setting the flag for section to be dumped
    
                    state=nil; //reseting state..so that the next char starts from scratch
                }
            }
            else if(cur>='0'&&cur<='9') //condition for num state
            {
                if(state==nil) //Start a new num
                {
                    state=num;
                    start = sit - s.begin();
                }
                else if(state==ifr) //end the ifr,set flag for dump,start a new num
                {
                    state=num;
    
                    length=sit - s.begin() - start ;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                else if(state==sym) // look up the above else-if
                {
                    state=num;
                    length=sit - s.begin() - start ;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            else if((cur>='a'&&cur<='z')||(cur>='A'&&cur<='Z')) //look above
            {
                if(state==nil)
                {
                    state=ifr;
                    start = sit - s.begin();
                }
                else if(state==num)
                {
                    state=ifr;
    
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                else if(state==sym)
                {
                    state=ifr;
    
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            else //look above for clarification
            {
                if(state==nil)
                {
                    state=sym;
                    start = sit - s.begin();
                }
                else if(state==num)
                {
                    state=sym;
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                else if(state==ifr)
                {
                    state=sym;
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                //The following(extra) condition is there to make sure that sym's
                //are always single character..
                //They can be joined(:D)at a higher level if required
                else if(state==sym)
                {
                    state=sym;
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            if(uniflag==true) //Dumping Station
            {
    //            cout<<'\n'<<section<<'\n';
                lt.push_back(token(section));
            }
            if((sit == s.end()-1)&&(cur!=' ')) //Ending Condition (when not ' ')
            {
                length = s.end()-s.begin()-start;
                section = s.substr(start,length);
    //            cout<<'\n'<<section<<'\n';
                lt.push_back(token(section)); //another D.S.
            }
        }
        return lt;
            //If this return type needs to be changed,
            //^also modify the Dumping Stations above
    }
    The idea is that, it goes through the characters one at a time, determine if it a char, number or a symbol and dumps a substring into the output list when new types (like a number when the previous was a character or a symbol) begin to appear (the exception being symbols, assumed that they'd be single characters) .

    If you find this interesting enough to compile, use the following threadbare token class to test,
    Code:
    class token
    {
    public:
        std::string raw;
        token(std::string& input){raw=input;};
    };
    Last edited by manasij7479; 08-18-2011 at 07:10 AM.

  2. #2
    Registered User
    Join Date
    Mar 2009
    Posts
    344
    A few ideas -

    Look at using ctype macros like isspace, isdigit, etc.

    There's lots of duplicated code in if blocks. For example, this :

    Code:
            else if((cur>='a'&&cur<='z')||(cur>='A'&&cur<='Z')) //look above
            {
                if(state==nil)
                {
                    state=ifr;
                    start = sit - s.begin();
                }
                else if(state==num)
                {
                    state=ifr;
    
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                else if(state==sym)
                {
                    state=ifr;
    
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
    can become this

    Code:
            else if((cur>='a'&&cur<='z')||(cur>='A'&&cur<='Z')) //look above
            {
                state=ifr;
                            
                if(state!=nil)
               {
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    uniflag = true;
                }
               start = sit - s.begin();
            }
    Or if there are other states, make it a switch with the num and sym cases having shared code.

    I'm also a fan of keeping vars local to the scope they're used in, so declare int length inside each block it's used in instead of at the top level. It's a hint that it's not used later on in the code when you're debugging or changing the code.

  3. #3
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Thanks, the code looks more compact after following some of your suggestions.
    Code:
    #include "lexer.h"
    #include <iterator>
    #include <string>
    #include<cctype>
    using std::list;
    using std::string;
    list<token> tokenize(string s)
    {
        list<token> lt;
        string::iterator sit;
        enum State {nil,num,ifr,sym} state(nil);   //denoting nothing,number,identifier,other single char symbols
        int start,length;    //indicating start and lenght of substr
        char cur(0);   //current char
        string section;  //substr'ed string
        bool uniflag(false);   //for noting if a section is ready to be pushed_back
    
        for(sit=s.begin();sit!=s.end();sit++)
        {
            uniflag = false;
            cur = *sit;
            if(cur==' ')  //for skipping whitespaces && pusing back preceding ones
            {
                if(state==nil)  //Just Skip
                    continue;
                else  //i.e if state is num, ifr or sym
                {//read this block carefully...
                    length=sit - s.begin() - start ;
                        //sit (an iterator ) - s.begin()
                        //gives the current position.
                        //^*that - start (an int)
                        //gives the length :D
                    section = s.substr(start,length);
                    uniflag = true;
                        //Setting the flag for section to be dumped
                    state=nil; //reseting state..so that the next char starts from scratch
                }
            }
            else if(isdigit(cur)) //condition for num state
            {
                if(state==nil) //Start a new num
                {
                    state=num;
                    start = sit - s.begin();
                }
                else if(state==ifr||state==sym) //end the ifr or sym,set flag for dump,start a new num
                {
                    state=num;
                    length=sit - s.begin() - start ;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            else if(isalpha(cur)||(cur=='"'||cur=='\''))//ifr case(also handles literals) //look above
            {
                if(state==nil)
                {
                    state=ifr;
                    start = sit - s.begin();
                }
                else if(state==num || state == sym)
                {
                    state=ifr;
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
    
            }
            else//case sym //look above for clarification
            {
                if(state==nil)
                {
                    state=sym;
                    start = sit - s.begin();
                }
                else if(state==num||state==ifr||state==sym) //state==sym to make sure that symbols are uni-character
                {
                    state=sym;
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            if(uniflag==true) //Dumping Station
                lt.push_back(token(section));
    
            if((sit == s.end()-1)&&(cur!=' ')) //Ending Condition (when not ' ')
            {
                length = s.end()-s.begin()-start;
                section = s.substr(start,length);
                lt.push_back(token(section)); //another D.S.
            }
        }
        return lt;//If this return type needs to be changed, also modify the Dumping Stations above.
    }
    Also, I don't think declaring length inside the individual blocks is a good idea, because in this the 'length' has to be stored across the iterations and needs to be accessed from different blocks.

    I ran into another problem though;
    The code as it is now, interprets fractional numbers as two integers separated by a '.' token. Should I make an explicit case to stop that or should it be resolved at a higher state, by making the '.' operator between integers behave to return a double ?

  4. #4
    Registered User
    Join Date
    Mar 2009
    Posts
    344
    Quote Originally Posted by manasij7479 View Post
    Also, I don't think declaring length inside the individual blocks is a good idea, because in this the 'length' has to be stored across the iterations and needs to be accessed from different blocks.
    Are you sure? It seems that length is just a temporary inside each block. It doesn't matter that much, though since the compiler will treat them as separate instances anyway.

    I ran into another problem though;
    The code as it is now, interprets fractional numbers as two integers separated by a '.' token. Should I make an explicit case to stop that or should it be resolved at a higher state, by making the '.' operator between integers behave to return a double ?
    What I've seen in that the lexer will decide if a number is an int or a float and higher levels work with the complete number. Makes sense since the parser typically works on complete tokens rather than having to handle the various x.y, x.ye-z and so on formats the floating point number could be. It makes the parser code easier.

  5. #5
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Quote Originally Posted by KCfromNC
    What I've seen in that the lexer will decide if a number is an int or a float and higher levels work with the complete number. Makes sense since the parser typically works on complete tokens rather than having to handle the various x.y, x.ye-z and so on formats the floating point number could be. It makes the parser code easier.
    Then, would making the tokens go through another loop, where they(and some multi-character operators) are combined back, be a good idea ?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Code review
    By bennywhere in forum C Programming
    Replies: 16
    Last Post: 10-20-2009, 09:00 PM
  2. Simple calculator -- code review
    By jafet in forum C++ Programming
    Replies: 7
    Last Post: 09-22-2006, 11:49 PM
  3. c++ review
    By san in forum C++ Programming
    Replies: 3
    Last Post: 07-10-2002, 06:56 PM
  4. Can someone review?
    By ob1cnobe in forum C++ Programming
    Replies: 2
    Last Post: 06-03-2002, 09:48 PM