logos: Lexer produces wrong tokens when more input is provided

My logos lexer implementation somehow does not match the TK_NOT token when there is more input (like a whitespace) after it. Instead it matches the TK_WORD token in that case, which should be wrong when it has a lower priority.

Reproducible example with tests:


use logos::Logos;

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Logos)]
#[allow(non_camel_case_types)]
pub enum SyntaxKind {
    #[regex(r"[ \t]+", priority = 1)]
    TK_WHITESPACE = 0,
    #[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 1)]
    TK_WORD,

    #[token("not", priority = 50)]
    TK_NOT,
    #[token("not in", priority = 60)]
    TK_NOT_IN,
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn single_not_works() {
        let mut lexer = SyntaxKind::lexer("not");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
    }

    #[test]
    fn word_then_not_works() {
        let mut lexer = SyntaxKind::lexer("word not");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
    }

    #[test]
    fn but_this_does_not_work() {
        let mut lexer = SyntaxKind::lexer("not word");
        // FAILED because
        // Left:  Some(Ok(TK_WORD)
        // Right: Some(Ok(TK_NOT)
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
    }

    #[test]
    fn this_is_fine() {
        let mut lexer = SyntaxKind::lexer("not in ");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT_IN)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
    }
}

I know that the situation with TK_NOT and TK_NOT_IN is maybe not ideal (if I remove the latter it works again). But for my parser it would be way better to have these tokens rather than two separate TK_NOT and TK_IN tokens. I would be thankfully for any suggestions that don’t require me to remove either of TK_WORD or TK_NOT_IN to make the test but_this_does_not_work run.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 20 (1 by maintainers)

Most upvoted comments

I had been running into what seemed like a similar issue, where a longer match would fail and it would backtrack to the wrong token. What was also strange was that the &str (ie. lex.slice()) contained within the mismatched token was actually the slice for what should have been matched. So to hack around the issue I had simply created a callback that took lex.slice() and created a new lexer to re-parse slice on its own and correct the token.

The code to reproduce my issue is a lot simpler as I was able to reduce it down to not actually perform any regex matching at all, although it still needed a #[regex(...)] token to trigger the bug seemingly due to differences in how the graph is generated for regex vs text tokens.

use logos::Logos;

#[derive(Debug, PartialEq, Eq, Logos)]
pub enum Token<'s> {
    #[token("Foo")]
    Foo(&'s str),
    #[token("FooBar")]
    FooBar(&'s str),
    #[regex("FooBarQux")]
    FooBarQux(&'s str),
}

#[cfg(test)]
mod tests {
    use itertools::Itertools;
    use logos::Logos;

    use super::*;

    #[test]
    fn test() {
        let lexer = Token::lexer("FooBarQ");
        let expected: Vec<Result<_, ()>> = vec![Ok(Token::FooBar("FooBar"))];
        let results = lexer
            .zip_longest(expected)
            .map(|x| x.left_and_right())
            .unzip::<_, _, Vec<_>, Vec<_>>();
        //  left: [Some(Ok(Foo("FooBar")))]
        // right: [Some(Ok(FooBar("FooBar")))]
        assert_eq!(results.0, results.1);
    }
}

Potential Fix?

I took some time today to look into it and the issue appeared to be in the generator, which would explain why priority had no effect (both for me and @MalteJanz), as that’s only used in the steps before generator. The issue appears to be with context tracking and that the backtrack point wasn’t being updated, even though the slice was getting bumped correctly.

I was able to resolve my issue by adding || meta.min_read == 0 to the if logic at https://github.com/maciejhirsz/logos/blob/v0.14/logos-codegen/src/generator/mod.rs#L115-L119, as that appears to be what causes the slice to be bumped correctly in the match block following the if. After that change I was able to remove my re-lexing hack and all my tests still pass.

The change to the if logic also appears to fix the issue in the code originally posted by @MalteJanz, but strangely not in the newly posted minimal reproduction case.

I’m not familiar enough with the logos internals to say if this is the correct fix though, as it doesn’t fix the issue with @MalteJanz’s minimal case, although perhaps that’s actually a separate bug.

The case originally posted by @MalteJanz does seem very similar to what I was doing, where I was matching a separator token (TK_WHITESPACE) and a sequence followed by a separator ("not in ") would match incorrectly.

hi @jameshurst Thanks for your contribution! It has solved the issue that was blocking me for several days. However, I noticed that the PR still does not address the following case:

pub enum Token {   
    #[regex(r"-?0+", priority = 10)]
    A,

    #[regex(r"(0|#)+")]
    B,
}

The PR works well when the leading -? is removed. I think the root cause is Fork.merge fails to take the priority settings into consideration when merging two Leaf node. I try to fix it with some ugly code and it works fine in all my own cases. However, my fix will break colors::match_colors, crunch::crunch, priority_disambiguate_1::priority_abc as well as the css::test_letter_spacing mentioned above.

In my scenario, I assume that ‘priority’ should take precedence over ‘longest match’. I’m curious about the design principle of this library: does ‘longest match’ or ‘priority’ prevail? If priority is supposed to prevail, I could find some time to refine the code and submit a PR.

I was able to resolve my issue by adding || meta.min_read == 0 to the if logic at https://github.com/maciejhirsz/logos/blob/v0.14/logos-codegen/src/generator/mod.rs#L115-L119, as that appears to be what causes the slice to be bumped correctly in the match block following the if.

After staring at the code some more, I believe the proper fix may actually be to just remove the if logic alltogether and always perform ctx.switch(self.graph[id].miss()).

Sidenote: After removing the if statement, the ctx.switch function can be simplified to not perform ctx.bump and no longer return a token stream. The match logic below the if statement can then be simplified to only if enters_loop || meta.min_read == 0.

From what I can see Generator::goto will be called to generate:

  • the root node
  • the _ branch for a match statement
  • any other branches for a match statement

For each of these cases I would expect self.graph[id].miss() to indicate the correct backtrack position, or None which would cause ctx.switch to keep the existing backtrack.

I tested this change against my own library, the original code for this issue, and https://github.com/maciejhirsz/logos/issues/160 and they are all passing.

I’d be interested to hear your thoughts on this @jeertmans (and @maciejhirsz’s if available).