regex: $ doesn't match CRLF
I created a regex with multi_line set to true, but after debugging why the regex was matching in a unittest but not in a file, I found out that $ isn’t matching the end of a line in the file. I’m using Windows so the newlines are \r\n.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 67 (45 by maintainers)
Links to this issue
Commits related to this issue
- Normalize "\r\n" to "\n" to ensure ^ and $ match line boundaries The regex crate only considers "\n" a line boundary, so ^ and $ don't match "\r\n". This can be a problem if code is checked out on Wi... — committed to mgeisler/version-sync by mgeisler 4 years ago
- Normalize "\r\n" to "\n" to ensure ^ and $ match line boundaries The regex crate only considers "\n" a line boundary, so ^ and $ don't match "\r\n". This can be a problem if code is checked out on Wi... — committed to mgeisler/version-sync by mgeisler 4 years ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- syntax: add support for CRLF-aware line anchors This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '... — committed to rust-lang/regex by BurntSushi a year ago
- changelog: 1.9.0 I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Clo... — committed to rust-lang/regex by BurntSushi a year ago
- Update Rust crate regex to 1.9.1 (#1957) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies |... — committed to Calciumdibromid/CaBr2 by deleted user a year ago
- changelog: 1.9.0 I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Clo... — committed to sourcefrog/regex by BurntSushi a year ago
OK, I’m happy to report that this feature should land once #656 is complete. I have it working. Would folks like to review the docs for it?
The key things to note here are:
RegexBuilder::crlf, or by using the newRinline flag.\ron its own is treated as a line terminator, as suggested by @BatmanAoD above. The key trick is that^and$won’t match between a\rand a\n. (I would ideally rather not treat\ras a line terminator unto itself, but it’s just not feasible to do.).becomes[^\r\n]instead of[^\n].For a deeper dive, here’s a smattering of tests showing CRLF mode semantics:
It was quite gnarly to add, and in so doing, I actually uncovered a bug in the lazy DFA (present in both the status quo and my rewrite):
@phaazon Please don’t post meme images on any issue tracker that I maintain.
(Also, just last week I actually did run into something that uses
\ron its own as EOL by default: Putty in serial mode! I was shocked.)I think it’s entirely reasonable to unconditionally consider
\rto be the start of a line-ending, and handle the “\nfollowing\r” situation as the actual special case. (That’s basically my suggestion from back in 2017. As I mentioned then, Putty actually does use\rin isolation by default, so it’s not an entirely obsolete form of newline.)That sounds right to me. As a user, I can’t imagine a scenario where
.failing to match\rwould be would be more surprising than matching would be (outside ofdot_matches_new_linemode, of course).I assume that’s exactly why they phrased it the way they did, i.e., they’re talking about implementing a “multiline” mode (mentioned above the list), such as this library’s
dot_matches_new_line. The requirement (as I read it) is that in this mode,.matches exactly the same things that$would match.@jzabroski Yes. .NET is a backtracking based implementation, right? In that context, the implementation is far more flexible. UTS#18 is a bit of a tortured document, where the writers of it were clearly aware that some of their recommendations would be quite difficult to satisfy in the context of finite automata based regex engines, which is the case here.
In particular, I will definitely not be doing this:
$Unicode-aware. If it was easy to do this, I’d do it. But it’s not. Combine that with the fact that I don’t think I have ever seen anyone actually want a Unicode-aware$means I’m not really motivated by this. CRLF is different, because of Windows..needs to become[^\r\n]. I can’t do anything more complex than that. And I note that the wording of UTS#18 around.is really really weird: “Where the ‘arbitrary character pattern’ matches a newline sequence, it must match all of the newline sequences.” Like, huh?.is usually specified to not match a newline sequence. And the writing goes on to say that treating\r\nas a single unit is not necessary for conformance.Otherwise, treating
\ras a line-ending and\r\nas a single line ending seems like a strict subset of UTS#18.I’ve been continuing to think about this, and I’m wondering if there is another path here. What if we did treat
\ras a new line, but didn’t match between a\rand a\n? I think this would let us avoid needing to extra the delaying of matches from 1 byte to 2 bytes. Instead,^and$can be handled a bit like\bis handled: as both a look-behind and a look-ahead assertion. The key here is that if\ris treated a newline unto itself, then$is known to match at positions where\rfollows no matter what. At that point, all we have to do is ensure that it doesn’t match at positions between a\rand\n. And I think that’s doable.I was inspired by looking at this mailing list thread again after thinking about the problem: https://groups.google.com/g/re2-dev/c/uNi95OBEiTY
By the way, sorry if my comment left you confused.
Do you happen to know which Unicode Regex libraries do support Unicode New Line? I went through icu4j source code, your source code, and a bunch of other places, and I don’t know where Mark Davis came up with his guidance for what a Unicode New Line is, since nobody seems all that interested in portable New Line characters.
I mean, I guess if it’s just for this one sigil, you have a great point. But you can take this “best path” too far, too. This is what IBM did for java.util.Regex: they wrote a Java pre-compiler that replaces ICU Regex with java.util.Regex, and… it’s IBM.
I guess if you’re not going to fix this issue, at least update the documentation referencing TR18 to clarify that you don’t handle “Unicode New Line” in R1.6 (which is what we’ve been discussing for .NET).
This is what I do for ripgrep: https://docs.rs/grep-regex/0.1.3/grep_regex/struct.RegexMatcherBuilder.html#method.crlf
Not to me. Sounds like the best path for some scenarios.
Also one more thing and I shut up: it turns out that .NET which is probably one of the highest authorities on windows newline handling has the same behavior as python, this crate and go: https://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Multiline
@mitsuhiko Oh interesting. I should have known for Go, but it’s interesting to see that Python doesn’t do it either:
So I guess we’re in good company?
But the existing implementation is more wrong, so I’m not sure I understand that as an objection.