lark: Contextual lexer matching terminal that can't be reached in context

I think this is a bug, but it may be possible I’m misunderstanding how the contextual lexer works.

The grammar I have has tokens based on the context, but until now it seemed to work ok using lalr with the contextual lexer, but after I added the return_stmt it seems it started to do some wrong choices on how to lex the code.

In this case, where the grammar should parse a NAME_CONT it’s parsing a DEC_NUMBER.

Let me give some more info:

The grammar is trying to match the code:

*** Functions ***
My Function 1
    While var 2
        var 3

and it’s matching the 1 from My Function 1 as a DEC_NUMBER instead of a NAME_CONT (even though a DEC_NUMBER isn’t valid at that context and a NAME_CONT is (the identifier rule is: identifier: NAME (NAME|NAME_CONT|WS)* and it matched the NAME for My and NAME_CONT for Function, yet it then failed to match the 1 as NAME_CONT and ended up matching as a DEC_NUMBER, which isn’t valid in the context.

The place where the rules should be matching is at the func_name – i.e.:

file_input: (_NEWLINE | root_stmt)*
?root_stmt: func_block
func_block: func_header WS? _NEWLINE (func_stmt)*
func_header: BLOCK_START "functions"i  BLOCK_END
func_stmt: func_name func_suite
func_name: identifier
?identifier: NAME  (NAME|NAME_CONT|WS)*
NAME: /[^\d\W]\w*/
NAME_CONT: /\w+/
WS: /\s+/

That alone works, but it fails in the actual grammar:

test_lark_contextual_lexer_issue.py.txt

This happened right after I added the return_stmt!

So, in that same grammar just removing the return_stmt the lexer seems to do the right thing – I’m not sure why though…

The docs say that:

The contextual lexer communicates with the parser, and uses the parser’s lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals.

Given that, I think it’s a bug in that the lexer is not considering the context as it should in this case – otherwise, I’m probably misunderstanding the approach on how it decides how to lex based on the context or the grammar can reach a DEC_NUMBER through some means that I can’t see (as far as I see it the func_suite needs a _NEWLINE to start, so, it should not be possible to get to a DEC_NUMBER from there).

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

In this case, where the grammar should parse a NAME_CONT it’s parsing a DEC_NUMBER

Is it actually parsing it as DEC_NUMBER, or only saying so in the UnexpectedToken error?

It’s actually parsing as a DEC_NUMBER.

The full exception trace is:

Traceback (most recent call last):
  File "X:\lark\lark\parsers\lalr_parser.py", line 86, in feed_token
    action, arg = states[state][token.type]
KeyError: 'DEC_NUMBER'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "X:\vscode-robot\rflang\tests\rflang_tests\test_lark_contextual_lexer_issue.py", line 97, in <module>
    parse(
  File "X:\vscode-robot\rflang\tests\rflang_tests\test_lark_contextual_lexer_issue.py", line 87, in parse
    tree = lark_spec.parse(source_code)
  File "X:\lark\lark\lark.py", line 493, in parse
    return self.parser.parse(text, start=start)
  File "X:\lark\lark\parser_frontends.py", line 138, in parse
    return self._parse(start, self.make_lexer(text))
  File "X:\lark\lark\parser_frontends.py", line 73, in _parse
    return self.parser.parse(input, start, *args)
  File "X:\lark\lark\parsers\lalr_parser.py", line 35, in parse
    return self.parser.parse(*args)
  File "X:\lark\lark\parsers\lalr_parser.py", line 129, in parse
    return self.parse_from_state(parser_state)
  File "X:\lark\lark\parsers\lalr_parser.py", line 145, in parse_from_state
    raise e
  File "X:\lark\lark\parsers\lalr_parser.py", line 136, in parse_from_state
    state.feed_token(token)
  File "X:\lark\lark\parsers\lalr_parser.py", line 89, in feed_token
    raise UnexpectedToken(token, expected, state=state, puppet=None)
lark.exceptions.UnexpectedToken: Unexpected token Token('DEC_NUMBER', '1') at line 3, column 13.
Expected one of: 
	* WS
	* _NEWLINE
	* NAME
	* NAME_CONT