logos: Silent ambiguous matches

As it stands now, ambiguous match in logos seems to produce last encountered token. I had a typo in a regex which caused two different tokens tokens to match at the same time:

#[token=" "]
White

#[token=" [\n ]*"
LongWhite

...

The outcome was that whenever ambiguity was encountered Logos silently did not update the token field. It is this expected/documented behaviour? I guess the speed is of concern, and ideally we don’t want to disambiguate at runtime, but perhaps at some point compile time checking could be introduced.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 17 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Great, thanks for lighting fast fix! Will sure report everything I will find since I am switching lexing exclusively to Logos. The “unusual” Hoon syntax is apparently good at breaking things 😉

@mikolajpp released 0.9.2 with a fix. Thanks again for being patient with this and adding the test case. The fact that we were able to isolate few cases where things progressively break helped a lot, since there were a couple interesting things happening in the branching algo here:

  • When building a fork for ~(m|h|s) when a branch ~[a-z] is present, an intersection should have been detected, but that only worked while the ~ token was absent. Adding ~ definition turned the fork on second byte into a Maybe variant which disabled this (this is now fixed with 0.9.2).
  • The slack was being picked up by the fallback mechanism, but that only works if the “longest” (representing most bytes) pattern on that byte also satisfies the first pattern of every other branch. Adding ~[, where [ does not belong in the set a-z disabled that part.
  • The intersection mechanism prior to 0.9.0 only matched single byte patterns intersecting with sets, that’s why (m|h|s) worked (because it creates 3 single byte branches), but [mhs] didn’t (because it created a single branch with a class pattern).

To make things properly robust in the future:

  • The fallback mechanism needs to check if a “longer” branch that is also a repeat matches a “shorter” branch, and then attaches the fallback pattern to that branch, instead of handling that logic during code generation.
  • in case of forks that share the first byte of a pattern, we should detect if those can be combined into a class instead of creating separate branches.

Feel free to open a new issue if you find anything else!

This bug has something to do with Logos’ handling of regexes. This works:

   #[token="~"]
    LiteralNull, 

    #[regex="~[a-z][a-z]+"]
    LiteralUrbitAddress, 

    #[regex="~(h|m|s)[0-9]+"]
    LiteralRelDate, 

    #[regex="~[1]"]
    LiteralAbsDate, 

This does not work

   #[token="~"]
    LiteralNull, 

    #[regex="~[a-z][a-z]+"]
    LiteralUrbitAddress, 

    #[regex="~(h|m|s)[0-9]+"]
    LiteralRelDate, 

    #[regex="~[01]"]
    LiteralAbsDate, 

This also does not work

   #[token="~"]
    LiteralNull, 

    #[regex="~[a-z][a-z]+"]
    LiteralUrbitAddress, 

    #[regex="~[hms][0-9]+"]
    LiteralRelDate, 

There seems to be crucial difference wrt how classes are handled v explicit forks. I have put test which triggers this into a branch here: https://github.com/mikolajpp/logos/tree/common-fork-bug

Right! Those are definitely bugs, thanks for reporting!