logos: Allow lexers to provide more user-friendly errors

At the moment errors are a single, opaque token. Is there a way to have more user-friendly, structured error values, with error recovery, instead of just returning Token::Error? It would be really nice to have an example of this too!

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 9
Comments: 19 (11 by maintainers)

Most upvoted comments

I’d also love to see this. Just letting the error variant be able to store something like an enum would be great.

We could use the Err variant of a Result from a parsing function and just store that in the error.

Example:

enum Parts<'a> {
    #[regex(".+", handle_text)]
    Text(&'a str),

    #[error]
    Error(ParsingError<'a>),
}

enum ParsingError<'a> {
    InvalidCharacter(&'a str)
}

fn handle_text<'a>(lex: &mut Lexer<Parts<'a>>) -> Result<&'a str, ParsingError<'a>> {
    let text = lex.slice();
    if text.contains("§") {
        return Err(ParsingError::InvalidCharacter("The character § can't be in the text!"));
    }

    Ok(text)
}

The lifetimes may be messed up a bit but i think you can understand what it is supposed to do.

nnt0 on Apr 20, 2020

One question which this brings up is whether there is a thought to expand #[error] to accept a callback like #[error(|lex|...)] and if that callback could be accessed from lex within handle_text above?

That would at least seem somewhat consistent with how regex Tokens manage construction with parameters.

ratmice on Apr 21, 2020

Ah, gotcha, so what you are missing is what was the token Logos was expecting to produce but failed.

I’ll have to think how to do that best, especially now that stateful enums have landed.

Speaking of, with 0.11 you can implement Logos directly on your Token<'a>: https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/callbacks.rs#L8-L14

And there is also a spanned iterator now, which should also play nicer with what you are doing there: https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/simple.rs#L228-L235

maciejhirsz on Apr 7, 2020

Following up the discussion from #135, and expanding on the suggestion above here, it might be useful to have an error callback for regexes that returns error type:

#[derive(Logos)]
enum Token {
    #[regex("some_complex_regex", error = |_| MyErrorType::InvalidComplexMatch)]
    ComplexMatch,

    #[error(|_| MyErrorType::default())]
    Error(MyErrorType),
}

maciejhirsz on Apr 26, 2020

Lexers, or parsers, especially for PLs, tend to be exempt from the “fail fast” philosophy since you usually want to accumulate errors and report them in batches to the user.

Yeah, this is why I mentioned ‘with error recovery’ - I didn’t see this mentioned in the docs. If Logos supports this then that’s great!

I more want better information as to why an error occurred, so that I can produce nice, user-friendly errors using my library, codespan-reporting.

At the moment Pikelet has a bunch of lexer errors on the master branch: https://github.com/pikelet-lang/pikelet/blob/782a8853cbaf0c50ec668ef55df798e30cea0ef6/crates/pikelet-concrete/src/parse/lexer.rs#L43-L69

In contrast, at the moment this is what I am forced to output on the next branch (which uses Logos): https://github.com/pikelet-lang/pikelet/blob/0568c6ada6cc1937774cac15372f3acde40793c9/pikelet/src/surface/lexer.rs#L174

brendanzab on Apr 7, 2020

What sort of interface would you like to see?

You can use range (maybe we should rename it to span?) to get the position of the error, and slice to get the exact subslice that produced the error, as with every other token. Errors are guaranteed to be at least 1 byte long so that the lexer never produces an infinite loop (advance always, well, advances).

Lexers, or parsers, especially for PLs, tend to be exempt from the “fail fast” philosophy since you usually want to accumulate errors and report them in batches to the user. Having error be just another token has the nice benefit in that you can treat errors just as any other unexpected token when writing a parser to AST, which makes it much easier to write, and it tends to have a nice performance profile since you don’t have to do special case branching for error handling.

maciejhirsz on Apr 4, 2020