Thread: Lexing - looking at the "big picture"

  1. #1
    Registered User
    Join Date
    Dec 2012
    Posts
    34

    Lexing - looking at the "big picture"

    Hi all -

    I'm trying to get a much better idea of the "big picture" of the lexing process.

    In particular, I have noticed that of the lexers I've seen (not that many...), most of the "sub-lexer" functions (e.g. for keywords or identifiers) don't take any function arguments but they somehow collect characters and return a token.

    It is simple enough to do sub-lexers that take a string, process a few characters and then throw the result back to the main lex function. What I have difficulty with is the sort of "mismatch" between iterating along a string (or through a file) and getting a series (stream) of tokens.

    What I'm getting at is - how can a function that takes no parameters get characters from a string and build a token from those characters?

    What I would find really useful would be a description of a lexing/parsing "skeleton" (even in pseudo-code) that shows the "big picture".
    In *particular*, the *flow* from part to part - that's what I have problems picturing.

    For example - (feel free to correct / "beef out" the pseudo-code

    [code]

    Pseudo-code

    function - lex_kw_or_identifier
    Arguments: none
    Gets characters from ???. by using the ???? function
    Collects those characters and returns a token.

    function - main_lexer
    Arguments: none
    Gets characters from string/file using ??? function.
    Depending on character type, calls one of the sub-lexing functions.
    Returns the token received from those sub-lexing functions

    function - parser
    Arguments One ( token ) (or none?)
    Gets a token (from main_lexer) and uses "expect"
    function to check validity.

    function - main()
    The main C function.
    Arguments - filename (or none if parsing a string).
    If parsing a string - iterate along it ( ??? should this be done here or elsewhere ??? )
    Call the parse function.

    *** From iterating along the string, how do I get to a sequence of tokens? ***

    [ code]

    I have difficulty "matching" the iterating-along-a-string part (or through a file) with the "get a sequence of tokens" part. I hope you can see what I mean there...

    Any comments are appreciated!
    Cheers -
    Andy

  2. #2
    Registered User
    Join Date
    Jun 2017
    Posts
    157
    This article might give you some ideas.
    Syntax analysis: an example - CodeProject

  3. #3
    Registered User
    Join Date
    Dec 2012
    Posts
    34
    Quote Originally Posted by OldGuy2 View Post
    This article might give you some ideas.
    Syntax analysis: an example - CodeProject
    Thanks OldGuy2 - that article is great! Exactly what I was after!

    I apologise to all for popping in here so much in the last few days. Anyway, thanks to the help given here (and some experimenting) I now have all that I need!

    Cheers, bye for now -
    - Andy ( latte123 )

  4. #4
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Not sure if this helps, but just another way of approaching the whole lexing/tokenizing process:

    Code:
    #include <stdio.h> 
    #include <stdlib.h> 
    #include <string.h> 
     
    /* 
        Pattern matching functions; only length of matches returned 
    */ 
     
    int match_range(const char* current, char lower, char upper) 
    { 
        const char* next = current; 
        while(*next >= lower && *next <= upper) 
            ++next; 
        return next - current; 
    } 
     
    int match_group(const char* current, const char* group) 
    { 
        const char* next = current; 
        while(*next != 0 && strchr(group, *next) != 0) 
            ++next; 
        return next - current; 
    } 
     
    int match_character(const char* current, char character) 
    { 
        return *current == character ? 1 : 0; 
    } 
     
    int match_integer(const char* current) 
    { 
        int prefixed = 0; 
        if(match_character(current, '+') || match_character(current, '-')) 
            prefixed = 1; 
        int count = prefixed + match_range(current + prefixed, '0', '9'); 
    /* 
        Assume the convention that an integer cannot adjoin an identifier... 
    */     
        int match_identifier(const char*);    // Forward declaration 
        if(match_identifier(current + count)) 
            return 0; 
        return count; 
    } 
     
    int match_identifier(const char* current) 
    { 
    /* 
        Assume the convention that an identifier cannot begin with an integer... 
    */     
        const char* next = current; 
        int count = match_range(next, 'a', 'z') + match_range(next, 'A', 'Z') + match_character(next, '_'); 
        if(count == 0) 
            return 0; 
        next += count; 
        for(;;) 
        { 
            count = match_range(next, '0', '9') + match_range(next, 'a', 'z') + match_range(next, 'A', 'Z') + match_character(next, '_'); 
            if(count == 0) 
                break; 
            next += count; 
        } 
        return next - current; 
    } 
     
    int match_space(const char* current) 
    { 
        return match_group(current, "\x20\t\r\n"); 
    } 
     
    /* 
        Token lexer stuff 
    */ 
     
    enum 
    { 
        type_integer,  
        type_identifier,  
        type_space 
    }; 
     
    const char* token_type_to_text(int type) 
    { 
        if(type == type_integer) 
            return "integer";  
        if(type == type_identifier) 
            return "identifier"; 
        if(type == type_space) 
            return "space"; 
        return "???"; 
    }; 
     
    /* 
        This would normally be the function that actually moves the token  
        information into a data structure. We'll just print them for now... 
    */ 
    void process_token(int type, const char* current, int length) 
    { 
        printf("token: '"); 
        const char* next = current, * end = next + length; 
        while(next != end) 
            putchar(*next++); 
        printf("' [type: %s, length: %d]\n", token_type_to_text(type), length); 
    } 
     
    struct token_matcher 
    { 
        int type; 
        int (*match)(const char*);     
    }; 
     
    /* 
        Note: these should be ordered by decreasing "greed" 
        (eg: a matcher for '++' should come before one for '+')
    */
    struct token_matcher token_matchers[] =  
    { 
        { 
            type_integer,  
            match_integer 
        },  
        { 
            type_identifier,  
            match_identifier 
        },      
        { 
            type_space,  
            match_space 
        },  
    }; 
     
    void tokenize(const char* input) 
    { 
        const char* current = input; 
        for(;;) 
        { 
            int found = 0; 
            for(int index = 0; index < sizeof(token_matchers)/sizeof(token_matchers[0]); ++index) 
            { 
                int length = token_matchers[index].match(current); 
                if(length != 0) 
                { 
                    process_token(token_matchers[index].type, current, length); 
                    current += length; 
                    found = 1; 
                } 
            } 
            if(found == 0) 
                break; 
        } 
        if(*current != 0) 
            fprintf(stderr, "Warning: end of input not reached!\nData remaining: %s\n", current);  
    } 
     
    int main(void) 
    { 
        char buffer[1024] = "select mycol from mytable";  
        const int size = sizeof(buffer); 
        puts(buffer); 
        for(;;) 
        { 
            tokenize(buffer);  
            puts("Enter some more text to test:"); 
            fgets(buffer, size, stdin); 
            if(buffer[0] == '\n') 
                break; 
            buffer[size - 1] = 0;  
        } 
        return 0;  
    }

  5. #5
    Registered User
    Join Date
    Dec 2012
    Posts
    34
    Hi Sir Galahad -
    A very big thank-you for that code - it's very impressive! I'm sure it'll be very useful! That code has more of a "functional programming" flavour to it (and I'm a big fan of FP).
    This has given me lots to think about...
    Thanks again - bye for now -
    - Andy (latte123)

  6. #6
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by latte123 View Post
    Hi Sir Galahad -
    A very big thank-you for that code - it's very impressive! I'm sure it'll be very useful! That code has more of a "functional programming" flavour to it (and I'm a big fan of FP).
    This has given me lots to think about...
    Thanks again - bye for now -
    - Andy (latte123)
    You're quite welcome. Good luck!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 2
    Last Post: 12-08-2014, 08:12 PM
  2. Replies: 5
    Last Post: 10-12-2012, 06:17 AM
  3. Replies: 2
    Last Post: 08-19-2012, 06:15 AM
  4. "itoa"-"_itoa" , "inp"-"_inp", Why some functions have "
    By L.O.K. in forum Windows Programming
    Replies: 5
    Last Post: 12-08-2002, 08:25 AM
  5. "CWnd"-"HWnd","CBitmap"-"HBitmap"...., What is mean by "
    By L.O.K. in forum Windows Programming
    Replies: 2
    Last Post: 12-04-2002, 07:59 AM

Tags for this Thread