Review my simple Lexer please..

**manasij7479** · 08-18-2011

Hello (...)
I wrote this a while ago, while making a calculator.
This time I'm trying to write a simplistic interpreter.

So, any suggestions to improve the following code (please read the comments carefully) for reuse?
Assume that the constructor for the token class works correctly, it(the token class) obviously needs to be remade for this crazy idea !

Code:

#include "lexer.h"
#include "token.h"
#include <iterator>
#include <string>
//#include <iostream>  //activate if needed..
//using std::cout;
using std::list;
using std::string;
list<token> tokenize(string s)
{
    list<token> lt;
    string::iterator sit;

    enum State {nil,num,ifr,sym} state(nil);
        //^nothing,number,identifier,other single char symbols

    int start,length; //indicating start and lenght of substr

    char cur(0); //current char

    string section; //substr'ed string

    bool uniflag(false);
        //If a section is ready to be pushed_back

    for(sit=s.begin();sit!=s.end();sit++)
    {
        uniflag = false;
        cur = *sit;
        if(cur==' ')  //for skipping whitespaces && pusing back preceding ones
        {
            if(state==nil)  //Just Skip
                continue;
            else  //i.e if state is num, ifr or sym
            {//read this block carefully...

                length=sit - s.begin() - start ;
                    //sit (an iterator ) - s.begin()
                    //gives the current position.
                    //^*that - start (an int)
                    //gives the length :D

                section = s.substr(start,length);

                uniflag = true;
                    //Setting the flag for section to be dumped

                state=nil; //reseting state..so that the next char starts from scratch
            }
        }
        else if(cur>='0'&&cur<='9') //condition for num state
        {
            if(state==nil) //Start a new num
            {
                state=num;
                start = sit - s.begin();
            }
            else if(state==ifr) //end the ifr,set flag for dump,start a new num
            {
                state=num;

                length=sit - s.begin() - start ;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            else if(state==sym) // look up the above else-if
            {
                state=num;
                length=sit - s.begin() - start ;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        else if((cur>='a'&&cur<='z')||(cur>='A'&&cur<='Z')) //look above
        {
            if(state==nil)
            {
                state=ifr;
                start = sit - s.begin();
            }
            else if(state==num)
            {
                state=ifr;

                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            else if(state==sym)
            {
                state=ifr;

                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        else //look above for clarification
        {
            if(state==nil)
            {
                state=sym;
                start = sit - s.begin();
            }
            else if(state==num)
            {
                state=sym;
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            else if(state==ifr)
            {
                state=sym;
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            //The following(extra) condition is there to make sure that sym's
            //are always single character..
            //They can be joined(:D)at a higher level if required
            else if(state==sym)
            {
                state=sym;
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        if(uniflag==true) //Dumping Station
        {
//            cout<<'\n'<<section<<'\n';
            lt.push_back(token(section));
        }
        if((sit == s.end()-1)&&(cur!=' ')) //Ending Condition (when not ' ')
        {
            length = s.end()-s.begin()-start;
            section = s.substr(start,length);
//            cout<<'\n'<<section<<'\n';
            lt.push_back(token(section)); //another D.S.
        }
    }
    return lt;
        //If this return type needs to be changed,
        //^also modify the Dumping Stations above
}

The idea is that, it goes through the characters one at a time, determine if it a char, number or a symbol and dumps a substring into the output list when new types (like a number when the previous was a character or a symbol) begin to appear (the exception being symbols, assumed that they'd be single characters) .

If you find this interesting enough to compile, use the following threadbare token class to test,

Code:

class token
{
public:
    std::string raw;
    token(std::string& input){raw=input;};
};

**KCfromNC** · 08-18-2011

A few ideas -

Look at using ctype macros like isspace, isdigit, etc.

There's lots of duplicated code in if blocks. For example, this :

Code:

        else if((cur>='a'&&cur<='z')||(cur>='A'&&cur<='Z')) //look above
        {
            if(state==nil)
            {
                state=ifr;
                start = sit - s.begin();
            }
            else if(state==num)
            {
                state=ifr;

                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            else if(state==sym)
            {
                state=ifr;

                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }

can become this

Code:

        else if((cur>='a'&&cur<='z')||(cur>='A'&&cur<='Z')) //look above
        {
            state=ifr;
                        
            if(state!=nil)
           {
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                uniflag = true;
            }
           start = sit - s.begin();
        }

Or if there are other states, make it a switch with the num and sym cases having shared code.

I'm also a fan of keeping vars local to the scope they're used in, so declare int length inside each block it's used in instead of at the top level. It's a hint that it's not used later on in the code when you're debugging or changing the code.

**manasij7479** · 08-18-2011

Thanks, the code looks more compact after following some of your suggestions.

Code:

#include "lexer.h"
#include <iterator>
#include <string>
#include<cctype>
using std::list;
using std::string;
list<token> tokenize(string s)
{
    list<token> lt;
    string::iterator sit;
    enum State {nil,num,ifr,sym} state(nil);   //denoting nothing,number,identifier,other single char symbols
    int start,length;    //indicating start and lenght of substr
    char cur(0);   //current char
    string section;  //substr'ed string
    bool uniflag(false);   //for noting if a section is ready to be pushed_back

    for(sit=s.begin();sit!=s.end();sit++)
    {
        uniflag = false;
        cur = *sit;
        if(cur==' ')  //for skipping whitespaces && pusing back preceding ones
        {
            if(state==nil)  //Just Skip
                continue;
            else  //i.e if state is num, ifr or sym
            {//read this block carefully...
                length=sit - s.begin() - start ;
                    //sit (an iterator ) - s.begin()
                    //gives the current position.
                    //^*that - start (an int)
                    //gives the length :D
                section = s.substr(start,length);
                uniflag = true;
                    //Setting the flag for section to be dumped
                state=nil; //reseting state..so that the next char starts from scratch
            }
        }
        else if(isdigit(cur)) //condition for num state
        {
            if(state==nil) //Start a new num
            {
                state=num;
                start = sit - s.begin();
            }
            else if(state==ifr||state==sym) //end the ifr or sym,set flag for dump,start a new num
            {
                state=num;
                length=sit - s.begin() - start ;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        else if(isalpha(cur)||(cur=='"'||cur=='\''))//ifr case(also handles literals) //look above
        {
            if(state==nil)
            {
                state=ifr;
                start = sit - s.begin();
            }
            else if(state==num || state == sym)
            {
                state=ifr;
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }


        }
        else//case sym //look above for clarification
        {
            if(state==nil)
            {
                state=sym;
                start = sit - s.begin();
            }
            else if(state==num||state==ifr||state==sym) //state==sym to make sure that symbols are uni-character
            {
                state=sym;
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        if(uniflag==true) //Dumping Station
            lt.push_back(token(section));

        if((sit == s.end()-1)&&(cur!=' ')) //Ending Condition (when not ' ')
        {
            length = s.end()-s.begin()-start;
            section = s.substr(start,length);
            lt.push_back(token(section)); //another D.S.
        }
    }
    return lt;//If this return type needs to be changed, also modify the Dumping Stations above.
}

Also, I don't think declaring length inside the individual blocks is a good idea, because in this the 'length' has to be stored across the iterations and needs to be accessed from different blocks.

I ran into another problem though;
The code as it is now, interprets fractional numbers as two integers separated by a '.' token. Should I make an explicit case to stop that or should it be resolved at a higher state, by making the '.' operator between integers behave to return a double ?

**KCfromNC** · 08-19-2011

Originally Posted by manasij7479

Also, I don't think declaring length inside the individual blocks is a good idea, because in this the 'length' has to be stored across the iterations and needs to be accessed from different blocks.

Are you sure? It seems that length is just a temporary inside each block. It doesn't matter that much, though since the compiler will treat them as separate instances anyway.

I ran into another problem though;
The code as it is now, interprets fractional numbers as two integers separated by a '.' token. Should I make an explicit case to stop that or should it be resolved at a higher state, by making the '.' operator between integers behave to return a double ?

What I've seen in that the lexer will decide if a number is an int or a float and higher levels work with the complete number. Makes sense since the parser typically works on complete tokens rather than having to handle the various x.y, x.ye-z and so on formats the floating point number could be. It makes the parser code easier.

**manasij7479** · 08-19-2011

Originally Posted by KCfromNC

What I've seen in that the lexer will decide if a number is an int or a float and higher levels work with the complete number. Makes sense since the parser typically works on complete tokens rather than having to handle the various x.y, x.ye-z and so on formats the floating point number could be. It makes the parser code easier.

Then, would making the tokens go through another loop, where they(and some multi-character operators) are combined back, be a good idea ?

Thread: Review my simple Lexer please..

Thread Tools

Search Thread

Display

Review my simple Lexer please..

Similar Threads

Code review

Simple calculator -- code review

c++ review

Can someone review?