Thread: Lexer token error conditions question

  1. #1
    Registered User
    Join Date
    Mar 2019
    Posts
    2

    Lexer token error conditions question

    I'm implementing a lexer for a simple programming language (ultimately an interpreter and code generator) as a project to learn C.

    I've got my lexer to a pretty good state and am able to parse identifiers, numbers, operators, but one issue I am trying to figure out is how to handle error conditions while 'lexing' the input. For example if I have a token for a number:

    237ab

    This is not a valid number because it ends with the alpha 'a' and 'b' characters. In this case, I have a couple different options:

    1. Immediately error out and let the user know of the line and position of the invalid token.
    2. Store the token anyway and allow later steps (such as parser) to actually do the validation
    3. Store the error, and return all errors at once at the end of lexing.

    Any ideas on which method is generally used or how error handling should be done with lexical analysis?

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    Stop and error.

    To continue means you need a mechanism to 'repair' the input in some way, either by adding or removing tokens.

    Say for example, turning 237ab into 237 ab, parse 237 as a number and then leave ab to the next step.
    But perhaps the better fix was to just delete 'ab' from the stream, and then carry on.

    If you get it wrong, you risk pages of phantom errors because you chose the wrong fix-up.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Mar 2019
    Posts
    2
    The way I've written the state machines is the parsing stops once a indeterminate state is found. For numbers if you don't see a numeric symbol it stops and looks for the next token. In this case '237' would be the number token. The parser then searches for the next character, in this case 'a', which is an alphabetic so it starts an identifier fsm to identify the token 'ab' as an identifier. If we move on to the parser, the language would then be ordered as two tokens, NUMBER (237) and IDENTIFIER (ab). When the parser is written, it should have an idea of the correct grammar for the language, and in my language having a number followed by an identifier would be invalid. So in that case I could use a parser to identify a grammatical error.

    Perhaps its still better to make the lexer a bit more complex to identify errors earlier as you mentioned.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Unknown error before a } token (very nooby question)
    By iZafiro in forum C Programming
    Replies: 3
    Last Post: 11-26-2013, 07:07 AM
  2. Replies: 4
    Last Post: 01-10-2012, 02:13 PM
  3. Syntax Error Before ';' Token
    By Mike4 in forum C Programming
    Replies: 4
    Last Post: 12-01-2010, 11:23 AM
  4. Error conditions?
    By FlamingAhura in forum C Programming
    Replies: 2
    Last Post: 11-14-2010, 06:12 PM
  5. syntax error before '.' token
    By strij85 in forum C++ Programming
    Replies: 3
    Last Post: 07-14-2004, 08:04 PM

Tags for this Thread