Lexer token error conditions question

**holyghost** · 03-12-2019

I'm implementing a lexer for a simple programming language (ultimately an interpreter and code generator) as a project to learn C.

I've got my lexer to a pretty good state and am able to parse identifiers, numbers, operators, but one issue I am trying to figure out is how to handle error conditions while 'lexing' the input. For example if I have a token for a number:

237ab

This is not a valid number because it ends with the alpha 'a' and 'b' characters. In this case, I have a couple different options:

1. Immediately error out and let the user know of the line and position of the invalid token.
2. Store the token anyway and allow later steps (such as parser) to actually do the validation
3. Store the error, and return all errors at once at the end of lexing.

Any ideas on which method is generally used or how error handling should be done with lexical analysis?

**Salem** · 03-12-2019

Stop and error.

To continue means you need a mechanism to 'repair' the input in some way, either by adding or removing tokens.

Say for example, turning 237ab into 237 ab, parse 237 as a number and then leave ab to the next step.
But perhaps the better fix was to just delete 'ab' from the stream, and then carry on.

If you get it wrong, you risk pages of phantom errors because you chose the wrong fix-up.

**holyghost** · 03-13-2019

The way I've written the state machines is the parsing stops once a indeterminate state is found. For numbers if you don't see a numeric symbol it stops and looks for the next token. In this case '237' would be the number token. The parser then searches for the next character, in this case 'a', which is an alphabetic so it starts an identifier fsm to identify the token 'ab' as an identifier. If we move on to the parser, the language would then be ordered as two tokens, NUMBER (237) and IDENTIFIER (ab). When the parser is written, it should have an idea of the correct grammar for the language, and in my language having a number followed by an identifier would be invalid. So in that case I could use a parser to identify a grammatical error.

Perhaps its still better to make the lexer a bit more complex to identify errors earlier as you mentioned.

Thread: Lexer token error conditions question

Thread Tools

Search Thread

Display

Lexer token error conditions question

Similar Threads

Unknown error before a } token (very nooby question)

error: expected ‘)’ before ‘*’ token

Syntax Error Before ';' Token

Error conditions?

syntax error before '.' token

Tags for this Thread