Yet another problem with substr

**manasij7479** · 05-25-2011

I'm writing a lexer to generate a list of tokens from a std::string..
The string called section is not getting the values it should...(if there is nothing else wrong)..
The code is as follows...the problem(almost surely) is at the bold regions..

Code:

#include "lexer.h"
#include "token.h"
#include <iterator>
#include <string>
//#include <iostream>
//using std::cout;
using std::list;
using std::string;
list<token> tokenize(string s)
{
    list<token> lt;
    string::iterator sit;
    enum State {nil,num,ifr,sym} state(nil);
        //^nothing,number,identifier,other single char symbols
    for(sit=s.begin();sit!=s.end();sit++)
    {
        static int start,length;
        static char cur; //current char
        static string section; //substr'ed string
        static bool uniflag(false);
            //If a section is ready to be pushed_back
        cur = *sit;
        if(cur>='0'&&cur<='9')
        {
            if(state==nil)
            {
                state=num;
                start = sit - s.begin();
            }
            else if(state==ifr)
            {
                state=num;
                
                length=sit - s.begin() - start ;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        else if(cur>='a'&&cur<='z')
        {
            if(state==nil)
            {
                state=ifr;
                start = sit - s.begin();
            }
            else if(state==num)
            {
                state=ifr;
                
                length=sit - s.begin() - start;
                section = s.substr(start,length);       
                start = sit - s.begin();
                uniflag = true;
            }


        }
        else
        {
            state = sym;
            section = cur;
            uniflag = true;
        }

        if(uniflag==true)
        {
//            cout<<'\n'<<section<<'\n';
            lt.push_back(token(section));
        }

    }

    return lt;
}

**tabstop** · 05-25-2011

So if length = start - start, I'm guessing that is always going to be 0. You probably want to compute the length using the old value of start, meaning you should compute length first then reset start to the start of the new token.

**manasij7479** · 05-25-2011

Did so...edited the original...but same problem remains...
i.e...'section' still isn't getting anything but garbage..

**tabstop** · 05-25-2011

I kind of hoped you would fix length in the obvious way, but I should have said: WTF are you using sit.begin()? Get it out, get it out! Where you are minus where the token started, not where the string started.

**manasij7479** · 05-25-2011

...I'm somewhat lost again...
Please explain that again...
I never used sit.begin()..
and why would

Code:

sit - s.begin() - start ;

//..contain the address of (minus where the token started) ?

**tabstop** · 05-25-2011

EDIT: Previous stuff gone.

start is an int, not an iterator. That's what I was missing.

Your sym case is broken, since it doesn't reset things (or, if it's the first case, set things). Hard to tell what's up with the rest but will keep looking. Sorry about the bogus stuff that was here earlier.

**tabstop** · 05-25-2011

Also: why all the static variables?

**manasij7479** · 05-25-2011

...I was not sure ..if the repeated running of the for loop would try to declare the variables again and again....Would it?

*fixed the sym case*
I was also missing the condition about what happens when it reaches the end of input,...
Debugging it now..(there still seems to be a problem with the 'ifr' case...)..the output of "a+10b=99" is coming [+,+,+,+,10,=,=,=,99]

**whiteflags** · 05-25-2011

..if the repeated running of the for loop would try to declare the variables again and again....Would it?

Yes, and I'm not sure why this is a problem. If you don't want a variable to be redefined with every iteration the answer is to declare such a variable before the loop, extending its scope. Changing the variable's lifetime semantics with static is unnecessary and may actually introduce weird side effects.

**manasij7479** · 05-26-2011

Corrected it..Everything works as expected ..now.... Any other suggestion..?

Code:

#include "lexer.h"
#include "token.h"
#include <iterator>
#include <string>
//#include <iostream>
//using std::cout;
using std::list;
using std::string;
list<token> tokenize(string s)
{
    list<token> lt;
    string::iterator sit;
    enum State {nil,num,ifr,sym} state(nil);
        //^nothing,number,identifier,other single char symbols
    int start,length;
    char cur; //current char
    string section; //substr'ed string
    bool uniflag(false);
        //If a section is ready to be pushed_back
    for(sit=s.begin();sit!=s.end();sit++)
    {
        uniflag = false;
        cur = *sit;
        if(cur>='0'&&cur<='9')
        {
            if(state==nil)
            {
                state=num;
                start = sit - s.begin();
            }
            else if(state==ifr)
            {
                state=num;

                length=sit - s.begin() - start ;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            else if(state==sym)
            {
                state=num;

                length=sit - s.begin() - start ;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        else if(cur>='a'&&cur<='z')
        {
            if(state==nil)
            {
                state=ifr;
                start = sit - s.begin();
            }
            else if(state==num)
            {
                state=ifr;

                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            else if(state==sym)
            {
                state=ifr;

                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }

        }
        else //if(state==sym) ..
        {
            if(state==nil)
            {
                state=sym;
                start = sit - s.begin();
            }
            else if(state==num)
            {
                state=sym;
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
            else if(state==ifr)
            {
                state=sym;
                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;
            }
        }
        if(uniflag==true)
        {
//            cout<<'\n'<<section<<'\n';
            lt.push_back(token(section));
        }
        if((sit == s.end()-1))
        {
            length = s.end()-s.begin()-start;
            section = s.substr(start,length);
//            cout<<'\n'<<section<<'\n';
            lt.push_back(token(section));
        }
    }
    return lt;
}

The main part of the code is almost repeated 3 times...is there a way to make it common for the 3 cases(a function would require so many arguments ..that it'd become tedious)?....

**whiteflags** · 05-26-2011

Since this is in almost every branch of your if-else logic,

Code:

                length=sit - s.begin() - start;
                section = s.substr(start,length);
                start = sit - s.begin();
                uniflag = true;

it should have the same effect if you move it to a lower tab level. The only time you do anything different is when state == nil.

**manasij7479** · 05-26-2011

Don't know why..but doing so brought up a lot of problems....So, I'm living it alone now..

There's another big logical problem..... What should the lexer do in the face of tokens like '++' .
How to decide whether it is a single '++' increment .....or two '+'s in the RPN ..or prefix notation...?

**manasij7479** · 05-26-2011

For the last problem ('++') ..I thought about keeping them as a single token now..and then break them up later depending upon context...Is that the right approach ?

Thread: Yet another problem with substr

Thread Tools

Search Thread

Display

Yet another problem with substr

Similar Threads

Substr position/parameter problem

Substr

pointer . substr

substr() problem

using substr