Lexing - looking at the "big picture"

**latte123** · 01-09-2018

Hi all -

I'm trying to get a much better idea of the "big picture" of the lexing process.

In particular, I have noticed that of the lexers I've seen (not that many...), most of the "sub-lexer" functions (e.g. for keywords or identifiers) don't take any function arguments but they somehow collect characters and return a token.

It is simple enough to do sub-lexers that take a string, process a few characters and then throw the result back to the main lex function. What I have difficulty with is the sort of "mismatch" between iterating along a string (or through a file) and getting a series (stream) of tokens.

What I'm getting at is - how can a function that takes no parameters get characters from a string and build a token from those characters?

What I would find really useful would be a description of a lexing/parsing "skeleton" (even in pseudo-code) that shows the "big picture".
In *particular*, the *flow* from part to part - that's what I have problems picturing.

For example - (feel free to correct / "beef out" the pseudo-code

[code]

Pseudo-code

function - lex_kw_or_identifier
Arguments: none
Gets characters from ???. by using the ???? function
Collects those characters and returns a token.

function - main_lexer
Arguments: none
Gets characters from string/file using ??? function.
Depending on character type, calls one of the sub-lexing functions.
Returns the token received from those sub-lexing functions

function - parser
Arguments One ( token ) (or none?)
Gets a token (from main_lexer) and uses "expect"
function to check validity.

function - main()
The main C function.
Arguments - filename (or none if parsing a string).
If parsing a string - iterate along it ( ??? should this be done here or elsewhere ??? )
Call the parse function.

*** From iterating along the string, how do I get to a sequence of tokens? ***

[ code]

I have difficulty "matching" the iterating-along-a-string part (or through a file) with the "get a sequence of tokens" part. I hope you can see what I mean there...

Any comments are appreciated!
Cheers -
Andy

**OldGuy2** · 01-10-2018

This article might give you some ideas.
Syntax analysis: an example - CodeProject

**latte123** · 01-10-2018

Originally Posted by OldGuy2

This article might give you some ideas.
Syntax analysis: an example - CodeProject

Thanks OldGuy2 - that article is great! Exactly what I was after!

I apologise to all for popping in here so much in the last few days. Anyway, thanks to the help given here (and some experimenting) I now have all that I need!

Cheers, bye for now -
- Andy ( latte123 )

**Sir Galahad** · 01-10-2018

Not sure if this helps, but just another way of approaching the whole lexing/tokenizing process:

Code:

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
 
/* 
    Pattern matching functions; only length of matches returned 
*/ 
 
int match_range(const char* current, char lower, char upper) 
{ 
    const char* next = current; 
    while(*next >= lower && *next <= upper) 
        ++next; 
    return next - current; 
} 
 
int match_group(const char* current, const char* group) 
{ 
    const char* next = current; 
    while(*next != 0 && strchr(group, *next) != 0) 
        ++next; 
    return next - current; 
} 
 
int match_character(const char* current, char character) 
{ 
    return *current == character ? 1 : 0; 
} 
 
int match_integer(const char* current) 
{ 
    int prefixed = 0; 
    if(match_character(current, '+') || match_character(current, '-')) 
        prefixed = 1; 
    int count = prefixed + match_range(current + prefixed, '0', '9'); 
/* 
    Assume the convention that an integer cannot adjoin an identifier... 
*/     
    int match_identifier(const char*);    // Forward declaration 
    if(match_identifier(current + count)) 
        return 0; 
    return count; 
} 
 
int match_identifier(const char* current) 
{ 
/* 
    Assume the convention that an identifier cannot begin with an integer... 
*/     
    const char* next = current; 
    int count = match_range(next, 'a', 'z') + match_range(next, 'A', 'Z') + match_character(next, '_'); 
    if(count == 0) 
        return 0; 
    next += count; 
    for(;;) 
    { 
        count = match_range(next, '0', '9') + match_range(next, 'a', 'z') + match_range(next, 'A', 'Z') + match_character(next, '_'); 
        if(count == 0) 
            break; 
        next += count; 
    } 
    return next - current; 
} 
 
int match_space(const char* current) 
{ 
    return match_group(current, "\x20\t\r\n"); 
} 
 
/* 
    Token lexer stuff 
*/ 
 
enum 
{ 
    type_integer,  
    type_identifier,  
    type_space 
}; 
 
const char* token_type_to_text(int type) 
{ 
    if(type == type_integer) 
        return "integer";  
    if(type == type_identifier) 
        return "identifier"; 
    if(type == type_space) 
        return "space"; 
    return "???"; 
}; 
 
/* 
    This would normally be the function that actually moves the token  
    information into a data structure. We'll just print them for now... 
*/ 
void process_token(int type, const char* current, int length) 
{ 
    printf("token: '"); 
    const char* next = current, * end = next + length; 
    while(next != end) 
        putchar(*next++); 
    printf("' [type: %s, length: %d]\n", token_type_to_text(type), length); 
} 
 
struct token_matcher 
{ 
    int type; 
    int (*match)(const char*);     
}; 
 
/* 
    Note: these should be ordered by decreasing "greed" 
    (eg: a matcher for '++' should come before one for '+')
*/
struct token_matcher token_matchers[] =  
{ 
    { 
        type_integer,  
        match_integer 
    },  
    { 
        type_identifier,  
        match_identifier 
    },      
    { 
        type_space,  
        match_space 
    },  
}; 
 
void tokenize(const char* input) 
{ 
    const char* current = input; 
    for(;;) 
    { 
        int found = 0; 
        for(int index = 0; index < sizeof(token_matchers)/sizeof(token_matchers[0]); ++index) 
        { 
            int length = token_matchers[index].match(current); 
            if(length != 0) 
            { 
                process_token(token_matchers[index].type, current, length); 
                current += length; 
                found = 1; 
            } 
        } 
        if(found == 0) 
            break; 
    } 
    if(*current != 0) 
        fprintf(stderr, "Warning: end of input not reached!\nData remaining: %s\n", current);  
} 
 
int main(void) 
{ 
    char buffer[1024] = "select mycol from mytable";  
    const int size = sizeof(buffer); 
    puts(buffer); 
    for(;;) 
    { 
        tokenize(buffer);  
        puts("Enter some more text to test:"); 
        fgets(buffer, size, stdin); 
        if(buffer[0] == '\n') 
            break; 
        buffer[size - 1] = 0;  
    } 
    return 0;  
}

**latte123** · 01-10-2018

Hi Sir Galahad -
A very big thank-you for that code - it's very impressive! I'm sure it'll be very useful! That code has more of a "functional programming" flavour to it (and I'm a big fan of FP).
This has given me lots to think about...

Thanks again - bye for now -
- Andy (latte123)

**Sir Galahad** · 01-10-2018

Originally Posted by latte123

Hi Sir Galahad -
A very big thank-you for that code - it's very impressive! I'm sure it'll be very useful! That code has more of a "functional programming" flavour to it (and I'm a big fan of FP).
This has given me lots to think about...

Thanks again - bye for now -
- Andy (latte123)

You're quite welcome. Good luck!

Thread: Lexing - looking at the "big picture"

Thread Tools

Search Thread

Display

Lexing - looking at the "big picture"

Similar Threads

Problem in template that uses "decltype", "typename", and "std::remove_reference"

Console application works when "F5" the page but not when clicking "ConsoleApp.exe"

How to improve my mouse skills to select "start new thread" and "post reply"

"itoa"-"_itoa" , "inp"-"_inp", Why some functions have "

"CWnd"-"HWnd","CBitmap"-"HBitmap"...., What is mean by "

Tags for this Thread