tree-sitter: Helpful parser error messages
I can’t seem to figure out a way to surface parser errors to the user of my text editor. In particular, I’m working on a VSCode extension using tree-sitter to parse typescript files and surface parsing errors. Tree-sitter is much faster than the typescript language service, so far, but I can’t seem to figure out how to surface a useful error message to the user.
In particular, I’d like something like the Parsing error: ';' expected.
that eslint gives me. I figure that tree-sitter has some sort of context on the active branches in the parser, so should be able to surface that context in the error object.
If this feature already exists, and I’m just missing something, perhaps it should be included in the documentation site, as I couldn’t find anything there.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 40
- Comments: 18 (9 by maintainers)
Commits related to this issue
- Add diagnostics based on alanz's implementation See https://github.com/tree-sitter/tree-sitter/issues/255#issuecomment-786268102 — committed to silvanshade/tree-sitter by deleted user 3 years ago
- Add diagnostics based on alanz's implementation See https://github.com/tree-sitter/tree-sitter/issues/255#issuecomment-786268102 — committed to silvanshade/tree-sitter by deleted user 3 years ago
- Add diagnostics based on alanz's implementation See https://github.com/tree-sitter/tree-sitter/issues/255#issuecomment-786268102 — committed to silvanshade/tree-sitter by deleted user 3 years ago
- Add diagnostics to rust and web bindings See https://github.com/tree-sitter/tree-sitter/issues/255#issuecomment-786268102 — committed to silvanshade/tree-sitter by deleted user 3 years ago
- Add diagnostics to rust and web bindings See https://github.com/tree-sitter/tree-sitter/issues/255#issuecomment-786268102 — committed to silvanshade/tree-sitter by deleted user 3 years ago
I made something that seems to work, at https://github.com/alanz/tree-sitter/commit/28f835275186e6489df2e2ee50aff10ed947f5fe
I use it with https://github.com/alanz/haskell-tree-sitter/tree/ts_tree_edit and https://github.com/alanz/semantic/tree/report-diagnostics, in a proof of concept that shows up the diagnostics in a language server.
One workable approach could be storing a parser-state(s?) in MISSING nodes, and augmenting table generator with info such as named nonterminals + terminals that are valid to be shifted into that state(s). Together with inspecting the node’s parent, I think that would be workable information to produce good errors.
New to the project, however, so hard for me to determine how feasible that idea is. Should I take a crack at an implementation, or would I waste my time?
I am currently working on a language implementation and I would like to replace my current parser with a tree-sitter parser, but the error messages are going to be tough without contextual information about why the error occurred, so I am looking forward to this feature.
Tree-sitter doesn’t currently have this feature, but I agree that it would be a good feature to add.
Right now, Tree-sitter just tries to repair the error so that it can give you back a syntax tree that works for the rest of the file.
It’s totally possible to use the parse table to generate a message like
Expected tokens: ',' ']'
or something.Just to save you some trouble, the branch you reference doesn’t really provide information that is suitable for presenting a list of errors like this to the user.
From what I remember, you get a list of
ERROR
(or related) nodes but what you really need is information about why those nodes occurred, such as what the parser was expecting versus what it encountered. But tree-sitter still does not provide a way to access this information from the parsing state.If you want to be able to present this information to users of your parser you will likely need to have a separate syntax validation phase that walks the tree-sitter tree and computes this information. Unfortunately this means you essentially have to write a parser twice, in a sense, because you’re just duplicating the same structure you already defined through the tree-sitter rules. But that’s about the best you can manage right now.
Yep, I can confirm. After some time searching online, I realized the same thing. I am creating a small language for my compilers class at uni, and I need to do type checking as well as show parsing errors. For type checking, I am traversing the tree, generating symbol tables so I can look up types for identifiers and check expression values. This is tedious but straightforward. I declare fields explicitly and group grammar rules, such as expressions (values, logical, relational, arithmetic) and statements (do-while, if, etc.). Since my language is pretty simple, I use a switch for each group and call the function that traverses nodes of that group to yield values back. This way, I can generate symbol tables and perform type checking when needed for each case.
It seems that way to a certain extent. I think tree-sitter could make it easier to get the parsing errors and similar things, but to be honest, it’s quite easy to do, especially in Python:
an example:
It’s weird how tree-sitter generates parsing errors. For example, in the last one, as can be seen in the tree, it assumed that inside the
while()
there should have been a string because it’s the first rule under thevalue
rule, which, in turn, is the first rule under theexpression
rule (see attachedgrammar.js
). I don’t know, this seems counterintuitive. I would assume in these cases that it would say that the missing rule is an expression (as seen in the AST, it is what it should be), but it went on to assume that if there was something there, one of the rules that might fit is a string.I understand the premise of tree-sitter trying to parse a file and be able to recover from missing rules or unfitting tokens. But in my opinion, it assumes too much. Maybe I don’t fully understand how tree-sitter works, but I think it could just assume that there’s an expression and keep going. It assumed up to 2 more rules, which was completely unnecessary. Again, I am a newbie, and maybe my grammar is not well defined. Here is the link to the grammar.js file.
I get the impression that tree-sitter doesn’t really try to help you do this at the moment, down to the line/column values being 0-based rather than 1-based.
After looking at the linked linter, it appears that the only real way to do this right now would be to have a grammar that documents all the failure cases and injects error nodes into your tree. Parse the document and run a query to lookup all the error nodes and report them (remember to add 1 to line / column).
The main thing missing is a way to provide the parser some kind of callback that gets information about errors so that you can use your own mechanism for reporting errors.
Correct, in fact @aerijo put together a
linter-tree-sitter
package for Atom that takes advantage of this to show Tree-sitter parse errors in the standard linting interface: https://github.com/Aerijo/linter-tree-sitter/blob/master/lib/linter.jsAPI for this is already on the master branch.
This would be awesome, as I kinda expected that to already work.
Might also be very helpful for debugging.