logos: Allow lexers to provide more user-friendly errors
At the moment errors are a single, opaque token. Is there a way to have more user-friendly, structured error values, with error recovery, instead of just returning Token::Error
? It would be really nice to have an example of this too!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 9
- Comments: 19 (11 by maintainers)
I’d also love to see this. Just letting the error variant be able to store something like an enum would be great.
We could use the
Err
variant of aResult
from a parsing function and just store that in the error.Example:
The lifetimes may be messed up a bit but i think you can understand what it is supposed to do.
One question which this brings up is whether there is a thought to expand
#[error]
to accept a callback like#[error(|lex|...)]
and if that callback could be accessed fromlex
withinhandle_text
above?That would at least seem somewhat consistent with how
regex
Tokens manage construction with parameters.Ah, gotcha, so what you are missing is what was the token Logos was expecting to produce but failed.
I’ll have to think how to do that best, especially now that stateful enums have landed.
Speaking of, with 0.11 you can implement
Logos
directly on yourToken<'a>
: https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/callbacks.rs#L8-L14And there is also a spanned iterator now, which should also play nicer with what you are doing there: https://github.com/maciejhirsz/logos/blob/4005d707e4b79dbf73b29bf572008bd81551a6dd/tests/tests/simple.rs#L228-L235
Following up the discussion from #135, and expanding on the suggestion above here, it might be useful to have an error callback for regexes that returns error type:
Yeah, this is why I mentioned ‘with error recovery’ - I didn’t see this mentioned in the docs. If Logos supports this then that’s great!
I more want better information as to why an error occurred, so that I can produce nice, user-friendly errors using my library,
codespan-reporting
.At the moment Pikelet has a bunch of lexer errors on the master branch: https://github.com/pikelet-lang/pikelet/blob/782a8853cbaf0c50ec668ef55df798e30cea0ef6/crates/pikelet-concrete/src/parse/lexer.rs#L43-L69
In contrast, at the moment this is what I am forced to output on the next branch (which uses Logos): https://github.com/pikelet-lang/pikelet/blob/0568c6ada6cc1937774cac15372f3acde40793c9/pikelet/src/surface/lexer.rs#L174
What sort of interface would you like to see?
You can use
range
(maybe we should rename it tospan
?) to get the position of the error, andslice
to get the exact subslice that produced the error, as with every other token. Errors are guaranteed to be at least 1 byte long so that the lexer never produces an infinite loop (advance
always, well, advances).Lexers, or parsers, especially for PLs, tend to be exempt from the “fail fast” philosophy since you usually want to accumulate errors and report them in batches to the user. Having error be just another token has the nice benefit in that you can treat errors just as any other unexpected token when writing a parser to AST, which makes it much easier to write, and it tends to have a nice performance profile since you don’t have to do special case branching for error handling.