tree-sitter: Helpful parser error messages

I can’t seem to figure out a way to surface parser errors to the user of my text editor. In particular, I’m working on a VSCode extension using tree-sitter to parse typescript files and surface parsing errors. Tree-sitter is much faster than the typescript language service, so far, but I can’t seem to figure out how to surface a useful error message to the user.

In particular, I’d like something like the Parsing error: ';' expected. that eslint gives me. I figure that tree-sitter has some sort of context on the active branches in the parser, so should be able to surface that context in the error object.

If this feature already exists, and I’m just missing something, perhaps it should be included in the documentation site, as I couldn’t find anything there.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 40
  • Comments: 18 (9 by maintainers)

Commits related to this issue

Most upvoted comments

I made something that seems to work, at https://github.com/alanz/tree-sitter/commit/28f835275186e6489df2e2ee50aff10ed947f5fe

I use it with https://github.com/alanz/haskell-tree-sitter/tree/ts_tree_edit and https://github.com/alanz/semantic/tree/report-diagnostics, in a proof of concept that shows up the diagnostics in a language server.

One workable approach could be storing a parser-state(s?) in MISSING nodes, and augmenting table generator with info such as named nonterminals + terminals that are valid to be shifted into that state(s). Together with inspecting the node’s parent, I think that would be workable information to produce good errors.

New to the project, however, so hard for me to determine how feasible that idea is. Should I take a crack at an implementation, or would I waste my time?

I am currently working on a language implementation and I would like to replace my current parser with a tree-sitter parser, but the error messages are going to be tough without contextual information about why the error occurred, so I am looking forward to this feature.

Tree-sitter doesn’t currently have this feature, but I agree that it would be a good feature to add.

Right now, Tree-sitter just tries to repair the error so that it can give you back a syntax tree that works for the rest of the file.

It’s totally possible to use the parse table to generate a message like Expected tokens: ',' ']' or something.

I could use this in my own language parser, as I want to get a list of errors to send back to the user on failure.

Just to save you some trouble, the branch you reference doesn’t really provide information that is suitable for presenting a list of errors like this to the user.

From what I remember, you get a list of ERROR (or related) nodes but what you really need is information about why those nodes occurred, such as what the parser was expecting versus what it encountered. But tree-sitter still does not provide a way to access this information from the parsing state.

If you want to be able to present this information to users of your parser you will likely need to have a separate syntax validation phase that walks the tree-sitter tree and computes this information. Unfortunately this means you essentially have to write a parser twice, in a sense, because you’re just duplicating the same structure you already defined through the tree-sitter rules. But that’s about the best you can manage right now.

After looking at the linked linter, it appears that the only real way to do this right now would be to have a grammar that documents all the failure cases and injects error nodes into the tree.

Yep, I can confirm. After some time searching online, I realized the same thing. I am creating a small language for my compilers class at uni, and I need to do type checking as well as show parsing errors. For type checking, I am traversing the tree, generating symbol tables so I can look up types for identifiers and check expression values. This is tedious but straightforward. I declare fields explicitly and group grammar rules, such as expressions (values, logical, relational, arithmetic) and statements (do-while, if, etc.). Since my language is pretty simple, I use a switch for each group and call the function that traverses nodes of that group to yield values back. This way, I can generate symbol tables and perform type checking when needed for each case.

I get the impression that tree-sitter doesn’t really try to help you do this at the moment, down to the line/column values being 0-based rather than 1-based.

It seems that way to a certain extent. I think tree-sitter could make it easier to get the parsing errors and similar things, but to be honest, it’s quite easy to do, especially in Python:

def get_error_nodes(node: Node):
    def traverse_tree_for_errors(node: Node):
        for n in node.children:
            if n.type == "ERROR" or n.is_missing:
                yield n
            if n.has_error:
                # there is an error inside this node let's check inside
                yield from traverse_tree_for_errors(n)

    yield from traverse_tree_for_errors(node)


def print_error_line(er_line: int, padding: str, column_start, column_end, node_error):
    print(f"{padding}{er_line}{padding}{file_lines[er_line-1]}")
    padding_with_line_number = " " * (len(f"{er_line}") + column_start-1)
    cursor_size = max(1, column_end - column_start)
    print(
        f"{padding * 2}{Fore.RED}{padding_with_line_number}{'~' * cursor_size}")

    if node_error.has_error and node_error.is_missing:
        error_message = f"{node_error.sexp()[1:-1]}"
    else:
        unexpected_tokens = "".join(n.text.decode('utf-8')
                                    for n in node_error.children)
        error_message = f"Unexpected token(s): {unexpected_tokens}"
    print(
        f"{padding * 2}{Fore.RED}{padding_with_line_number}{error_message}:")


def print_error(root_node: Node, error_type: str = "SYNTAX_ERROR"):
    padding = " " * 5
    for node_error in get_error_nodes(root_node):
        er_line = node_error.start_point[0]+1
        column_start = node_error.start_point[1] + 1
        column_end = node_error.end_point[1] + 1
        print(
            f"{Fore.RED}{error_type}{Fore.RESET}:  {node_error.sexp()[1:-1]}")
        print(
            f"{padding}in file: '{file}:{er_line}:{column_start}:{column_end}', line: {er_line}", end=", ")
        print(
            f"from column {column_start} to {column_end}\n")
        print(f"{padding}{file_name}")
        if "--show-file" in sys.argv:
            print_file_with_errors(er_line, padding,
                                   column_start, column_end, node_error)
        else:
            print_error_line(er_line, padding, column_start,
                             column_end, node_error)

an example:

let string somestring /* missing ; */

do{
    input( a + b ); /* input only accepts an identifier as an argument */
}while(); /* Missing expression */
python main.py jspdl_language/test2.jspdl --ast                                         (tree-sitter) 
AST:
└── program
    ├── let_statement
    │   ├── let
    │   ├── type
    │   │   └── string
    │   ├── identifier
    │   └── ;
    └── do_while_statement
        ├── do
        ├── {
        ├── input_statement
        │   ├── input
        │   ├── (
        │   ├── identifier
        │   ├── ERROR
        │   │   ├── +
        │   │   └── identifier
        │   ├── )
        │   └── ;
        ├── }
        ├── while
        ├── (
        └── expression
            └── value
                └── expression_value
                    └── literal_string
        ├── )
        └── ;
SYNTAX_ERROR:  MISSING ";"
     in file: 'jspdl_language/test2.jspdl:1:22:22', line: 1, from column 22 to 22

     /Users/pepperonico/Developer/upm/TdL/jspdl_language/test2.jspdl
     1     let string somestring
                                ~
                                MISSING ";":
SYNTAX_ERROR:  ERROR (identifier)
     in file: 'jspdl_language/test2.jspdl:4:14:17', line: 4, from column 14 to 17

     /Users/pepperonico/Developer/upm/TdL/jspdl_language/test2.jspdl
     4         input( a + b);
                        ~~~
                        Unexpected token(s): +b:
SYNTAX_ERROR:  MISSING "literal_string"
     in file: 'jspdl_language/test2.jspdl:5:8:8', line: 5, from column 8 to 8

     /Users/pepperonico/Developer/upm/TdL/jspdl_language/test2.jspdl
     5     }while();
                  ~
                  MISSING "literal_string":

It’s weird how tree-sitter generates parsing errors. For example, in the last one, as can be seen in the tree, it assumed that inside the while() there should have been a string because it’s the first rule under the value rule, which, in turn, is the first rule under the expression rule (see attached grammar.js). I don’t know, this seems counterintuitive. I would assume in these cases that it would say that the missing rule is an expression (as seen in the AST, it is what it should be), but it went on to assume that if there was something there, one of the rules that might fit is a string.

I understand the premise of tree-sitter trying to parse a file and be able to recover from missing rules or unfitting tokens. But in my opinion, it assumes too much. Maybe I don’t fully understand how tree-sitter works, but I think it could just assume that there’s an expression and keep going. It assumed up to 2 more rules, which was completely unnecessary. Again, I am a newbie, and maybe my grammar is not well defined. Here is the link to the grammar.js file.

I get the impression that tree-sitter doesn’t really try to help you do this at the moment, down to the line/column values being 0-based rather than 1-based.

After looking at the linked linter, it appears that the only real way to do this right now would be to have a grammar that documents all the failure cases and injects error nodes into your tree. Parse the document and run a query to lookup all the error nodes and report them (remember to add 1 to line / column).

The main thing missing is a way to provide the parser some kind of callback that gets information about errors so that you can use your own mechanism for reporting errors.

Am I correct in understanding that if I actually hooked up the parser and looked at the tree using the API, you get better error information than simply “ERROR”?

Correct, in fact @aerijo put together a linter-tree-sitter package for Atom that takes advantage of this to show Tree-sitter parse errors in the standard linting interface: https://github.com/Aerijo/linter-tree-sitter/blob/master/lib/linter.js

API for this is already on the master branch.

This would be awesome, as I kinda expected that to already work.

Might also be very helpful for debugging.