Thread: Yet another problem with substr

  1. #1
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657

    Yet another problem with substr

    I'm writing a lexer to generate a list of tokens from a std::string..
    The string called section is not getting the values it should...(if there is nothing else wrong)..
    The code is as follows...the problem(almost surely) is at the bold regions..
    Code:
    #include "lexer.h"
    #include "token.h"
    #include <iterator>
    #include <string>
    //#include <iostream>
    //using std::cout;
    using std::list;
    using std::string;
    list<token> tokenize(string s)
    {
        list<token> lt;
        string::iterator sit;
        enum State {nil,num,ifr,sym} state(nil);
            //^nothing,number,identifier,other single char symbols
        for(sit=s.begin();sit!=s.end();sit++)
        {
            static int start,length;
            static char cur; //current char
            static string section; //substr'ed string
            static bool uniflag(false);
                //If a section is ready to be pushed_back
            cur = *sit;
            if(cur>='0'&&cur<='9')
            {
                if(state==nil)
                {
                    state=num;
                    start = sit - s.begin();
                }
                else if(state==ifr)
                {
                    state=num;
                    
                    length=sit - s.begin() - start ;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            else if(cur>='a'&&cur<='z')
            {
                if(state==nil)
                {
                    state=ifr;
                    start = sit - s.begin();
                }
                else if(state==num)
                {
                    state=ifr;
                    
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);       
                    start = sit - s.begin();
                    uniflag = true;
                }
    
    
            }
            else
            {
                state = sym;
                section = cur;
                uniflag = true;
            }
    
            if(uniflag==true)
            {
    //            cout<<'\n'<<section<<'\n';
                lt.push_back(token(section));
            }
    
        }
    
        return lt;
    }
    Last edited by manasij7479; 05-25-2011 at 09:43 PM.

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    So if length = start - start, I'm guessing that is always going to be 0. You probably want to compute the length using the old value of start, meaning you should compute length first then reset start to the start of the new token.

  3. #3
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Did so...edited the original...but same problem remains...
    i.e...'section' still isn't getting anything but garbage..
    Last edited by manasij7479; 05-25-2011 at 09:49 PM.

  4. #4
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    I kind of hoped you would fix length in the obvious way, but I should have said: WTF are you using sit.begin()? Get it out, get it out! Where you are minus where the token started, not where the string started.

  5. #5
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    ...I'm somewhat lost again...
    Please explain that again...
    I never used sit.begin()..
    and why would
    Code:
    sit - s.begin() - start ;
    //..contain the address of (minus where the token started) ?

  6. #6
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    EDIT: Previous stuff gone.

    start is an int, not an iterator. That's what I was missing.

    Your sym case is broken, since it doesn't reset things (or, if it's the first case, set things). Hard to tell what's up with the rest but will keep looking. Sorry about the bogus stuff that was here earlier.
    Last edited by tabstop; 05-25-2011 at 10:19 PM.

  7. #7
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Also: why all the static variables?

  8. #8
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    ...I was not sure ..if the repeated running of the for loop would try to declare the variables again and again....Would it?

    *fixed the sym case*
    I was also missing the condition about what happens when it reaches the end of input,...
    Debugging it now..(there still seems to be a problem with the 'ifr' case...)..the output of "a+10b=99" is coming [+,+,+,+,10,=,=,=,99]

  9. #9
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    ..if the repeated running of the for loop would try to declare the variables again and again....Would it?
    Yes, and I'm not sure why this is a problem. If you don't want a variable to be redefined with every iteration the answer is to declare such a variable before the loop, extending its scope. Changing the variable's lifetime semantics with static is unnecessary and may actually introduce weird side effects.

  10. #10
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Corrected it..Everything works as expected ..now.... Any other suggestion..?
    Code:
    #include "lexer.h"
    #include "token.h"
    #include <iterator>
    #include <string>
    //#include <iostream>
    //using std::cout;
    using std::list;
    using std::string;
    list<token> tokenize(string s)
    {
        list<token> lt;
        string::iterator sit;
        enum State {nil,num,ifr,sym} state(nil);
            //^nothing,number,identifier,other single char symbols
        int start,length;
        char cur; //current char
        string section; //substr'ed string
        bool uniflag(false);
            //If a section is ready to be pushed_back
        for(sit=s.begin();sit!=s.end();sit++)
        {
            uniflag = false;
            cur = *sit;
            if(cur>='0'&&cur<='9')
            {
                if(state==nil)
                {
                    state=num;
                    start = sit - s.begin();
                }
                else if(state==ifr)
                {
                    state=num;
    
                    length=sit - s.begin() - start ;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                else if(state==sym)
                {
                    state=num;
    
                    length=sit - s.begin() - start ;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            else if(cur>='a'&&cur<='z')
            {
                if(state==nil)
                {
                    state=ifr;
                    start = sit - s.begin();
                }
                else if(state==num)
                {
                    state=ifr;
    
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                else if(state==sym)
                {
                    state=ifr;
    
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
    
            }
            else //if(state==sym) ..
            {
                if(state==nil)
                {
                    state=sym;
                    start = sit - s.begin();
                }
                else if(state==num)
                {
                    state=sym;
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
                else if(state==ifr)
                {
                    state=sym;
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
                }
            }
            if(uniflag==true)
            {
    //            cout<<'\n'<<section<<'\n';
                lt.push_back(token(section));
            }
            if((sit == s.end()-1))
            {
                length = s.end()-s.begin()-start;
                section = s.substr(start,length);
    //            cout<<'\n'<<section<<'\n';
                lt.push_back(token(section));
            }
        }
        return lt;
    }
    The main part of the code is almost repeated 3 times...is there a way to make it common for the 3 cases(a function would require so many arguments ..that it'd become tedious)?....
    Last edited by manasij7479; 05-26-2011 at 12:16 AM.

  11. #11
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    Since this is in almost every branch of your if-else logic,
    Code:
                    length=sit - s.begin() - start;
                    section = s.substr(start,length);
                    start = sit - s.begin();
                    uniflag = true;
    it should have the same effect if you move it to a lower tab level. The only time you do anything different is when state == nil.

  12. #12
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Don't know why..but doing so brought up a lot of problems....So, I'm living it alone now..

    There's another big logical problem..... What should the lexer do in the face of tokens like '++' .
    How to decide whether it is a single '++' increment .....or two '+'s in the RPN ..or prefix notation...?

  13. #13
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    For the last problem ('++') ..I thought about keeping them as a single token now..and then break them up later depending upon context...Is that the right approach ?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Substr position/parameter problem
    By nair in forum C++ Programming
    Replies: 7
    Last Post: 12-22-2010, 08:59 PM
  2. Substr
    By mhenderson in forum C Programming
    Replies: 4
    Last Post: 08-04-2006, 01:44 AM
  3. pointer . substr
    By FoodDude in forum C++ Programming
    Replies: 2
    Last Post: 09-01-2005, 11:12 AM
  4. substr() problem
    By waxydock in forum C++ Programming
    Replies: 2
    Last Post: 03-28-2005, 04:01 AM
  5. using substr
    By riley03 in forum C++ Programming
    Replies: 2
    Last Post: 02-24-2002, 06:32 PM