regex: $ doesn't match CRLF

I created a regex with multi_line set to true, but after debugging why the regex was matching in a unittest but not in a file, I found out that $ isn’t matching the end of a line in the file. I’m using Windows so the newlines are \r\n.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 67 (45 by maintainers)

Commits related to this issue

Most upvoted comments

OK, I’m happy to report that this feature should land once #656 is complete. I have it working. Would folks like to review the docs for it?

    /// Enable or disable the "CRLF mode" flag by default.
    ///
    /// By default this is disabled. It may alternatively be selectively
    /// enabled in the regular expression itself via the `R` flag.
    ///
    /// When CRLF mode is enabled, the following happens:
    ///
    /// * Unless `dot_matches_new_line` is enabled, `.` will match any character
    /// except for `\r` and `\n`.
    /// * When `multi_line` mode is enabled, `^` and `$` will treat `\r\n`,
    /// `\r` and `\n` as line terminators. And in particular, neither will
    /// match between a `\r` and a `\n`.

The key things to note here are:

  • CRLF mode is opt-in.
  • You’ll be able to opt-in via RegexBuilder::crlf, or by using the new R inline flag.
  • \r on its own is treated as a line terminator, as suggested by @BatmanAoD above. The key trick is that ^ and $ won’t match between a \r and a \n. (I would ideally rather not treat \r as a line terminator unto itself, but it’s just not feasible to do.)
  • In CRLF mode, . becomes [^\r\n] instead of [^\n].

For a deeper dive, here’s a smattering of tests showing CRLF mode semantics:

# This is a basic test that checks ^ and $ treat \r\n as a single line
# terminator. If ^ and $ only treated \n as a line terminator, then this would
# only match 'xyz' at the end of the haystack.
[[test]]
name = "basic"
regex = '(?mR)^[a-z]+$'
haystack = "abc\r\ndef\r\nxyz"
matches = [[0, 3], [5, 8], [10, 13]]

# Tests that a CRLF-aware '^$' assertion does not match between CR and LF.
[[test]]
name = "start-end-non-empty"
regex = '(?mR)^$'
haystack = "abc\r\ndef\r\nxyz"
matches = []

# Tests that a CRLF-aware '^$' assertion matches the empty string, just like
# a non-CRLF-aware '^$' assertion.
[[test]]
name = "start-end-empty"
regex = '(?mR)^$'
haystack = ""
matches = [[0, 0]]

# Tests that a CRLF-aware '^$' assertion matches the empty string preceding
# and following a line terminator.
[[test]]
name = "start-end-before-after"
regex = '(?mR)^$'
haystack = "\r\n"
matches = [[0, 0], [2, 2]]

# Tests that a CRLF-aware '^' assertion does not split a line terminator.
[[test]]
name = "start-no-split"
regex = '(?mR)^'
haystack = "abc\r\ndef\r\nxyz"
matches = [[0, 0], [5, 5], [10, 10]]

# Same as above, but with adjacent runs of line terminators.
[[test]]
name = "start-no-split-adjacent"
regex = '(?mR)^'
haystack = "\r\n\r\n\r\n"
matches = [[0, 0], [2, 2], [4, 4], [6, 6]]

# Same as above, but with adjacent runs of just carriage returns.
[[test]]
name = "start-no-split-adjacent-cr"
regex = '(?mR)^'
haystack = "\r\r\r"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Same as above, but with adjacent runs of just line feeds.
[[test]]
name = "start-no-split-adjacent-lf"
regex = '(?mR)^'
haystack = "\n\n\n"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Tests that a CRLF-aware '$' assertion does not split a line terminator.
[[test]]
name = "end-no-split"
regex = '(?mR)$'
haystack = "abc\r\ndef\r\nxyz"
matches = [[3, 3], [8, 8], [13, 13]]

# Same as above, but with adjacent runs of line terminators.
[[test]]
name = "end-no-split-adjacent"
regex = '(?mR)$'
haystack = "\r\n\r\n\r\n"
matches = [[0, 0], [2, 2], [4, 4], [6, 6]]

# Same as above, but with adjacent runs of just carriage returns.
[[test]]
name = "end-no-split-adjacent-cr"
regex = '(?mR)$'
haystack = "\r\r\r"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Same as above, but with adjacent runs of just line feeds.
[[test]]
name = "end-no-split-adjacent-lf"
regex = '(?mR)$'
haystack = "\n\n\n"
matches = [[0, 0], [1, 1], [2, 2], [3, 3]]

# Tests that '.' does not match either \r or \n when CRLF mode is enabled. Note
# that this doesn't require multi-line mode to be enabled.
[[test]]
name = "dot-no-crlf"
regex = '(?R).'
haystack = "\r\n\r\n\r\n"
matches = []

It was quite gnarly to add, and in so doing, I actually uncovered a bug in the lazy DFA (present in both the status quo and my rewrite):

# A variant of the problem described here:
# https://github.com/google/re2/blob/89567f5de5b23bb5ad0c26cbafc10bdc7389d1fa/re2/dfa.cc#L658-L667
[[test]]
name = "alt-with-assertion-repetition"
regex = '(?:\b|%)+'
haystack = "z%"
bounds = [1, 2]
anchored = true
# This was reporting [1, 2], which is
# not the correct leftmost-first match.
matches = [[1, 1]]

@phaazon Please don’t post meme images on any issue tracker that I maintain.

(Also, just last week I actually did run into something that uses \r on its own as EOL by default: Putty in serial mode! I was shocked.)

I think it’s entirely reasonable to unconditionally consider \r to be the start of a line-ending, and handle the “\n following \r” situation as the actual special case. (That’s basically my suggestion from back in 2017. As I mentioned then, Putty actually does use \r in isolation by default, so it’s not an entirely obsolete form of newline.)

it sounds like . needs to become [^\r\n].

That sounds right to me. As a user, I can’t imagine a scenario where . failing to match \r would be would be more surprising than matching would be (outside of dot_matches_new_line mode, of course).

“Where the ‘arbitrary character pattern’ matches a newline sequence, it must match all of the newline sequences.” Like, huh? . is usually specified to not match a newline sequence.

I assume that’s exactly why they phrased it the way they did, i.e., they’re talking about implementing a “multiline” mode (mentioned above the list), such as this library’s dot_matches_new_line. The requirement (as I read it) is that in this mode, . matches exactly the same things that $ would match.

@jzabroski Yes. .NET is a backtracking based implementation, right? In that context, the implementation is far more flexible. UTS#18 is a bit of a tortured document, where the writers of it were clearly aware that some of their recommendations would be quite difficult to satisfy in the context of finite automata based regex engines, which is the case here.

In particular, I will definitely not be doing this:

  • Making $ Unicode-aware. If it was easy to do this, I’d do it. But it’s not. Combine that with the fact that I don’t think I have ever seen anyone actually want a Unicode-aware $ means I’m not really motivated by this. CRLF is different, because of Windows.
  • If I do add CRLF, it sounds like . needs to become [^\r\n]. I can’t do anything more complex than that. And I note that the wording of UTS#18 around . is really really weird: “Where the ‘arbitrary character pattern’ matches a newline sequence, it must match all of the newline sequences.” Like, huh? . is usually specified to not match a newline sequence. And the writing goes on to say that treating \r\n as a single unit is not necessary for conformance.

Otherwise, treating \r as a line-ending and \r\n as a single line ending seems like a strict subset of UTS#18.

I’ve been continuing to think about this, and I’m wondering if there is another path here. What if we did treat \r as a new line, but didn’t match between a \r and a \n? I think this would let us avoid needing to extra the delaying of matches from 1 byte to 2 bytes. Instead, ^ and $ can be handled a bit like \b is handled: as both a look-behind and a look-ahead assertion. The key here is that if \r is treated a newline unto itself, then $ is known to match at positions where \r follows no matter what. At that point, all we have to do is ensure that it doesn’t match at positions between a \r and \n. And I think that’s doable.

I was inspired by looking at this mailing list thread again after thinking about the problem: https://groups.google.com/g/re2-dev/c/uNi95OBEiTY

By the way, sorry if my comment left you confused.

Do you happen to know which Unicode Regex libraries do support Unicode New Line? I went through icu4j source code, your source code, and a bunch of other places, and I don’t know where Mark Davis came up with his guidance for what a Unicode New Line is, since nobody seems all that interested in portable New Line characters.

“clean CRLF by hand before using regex”, but that seems like completely insane.

Not to me. Sounds like the best path for some scenarios.

I mean, I guess if it’s just for this one sigil, you have a great point. But you can take this “best path” too far, too. This is what IBM did for java.util.Regex: they wrote a Java pre-compiler that replaces ICU Regex with java.util.Regex, and… it’s IBM.

I guess if you’re not going to fix this issue, at least update the documentation referencing TR18 to clarify that you don’t handle “Unicode New Line” in R1.6 (which is what we’ve been discussing for .NET).

This is what I do for ripgrep: https://docs.rs/grep-regex/0.1.3/grep_regex/struct.RegexMatcherBuilder.html#method.crlf

“clean CRLF by hand before using regex”, but that seems like completely insane.

Not to me. Sounds like the best path for some scenarios.

Also one more thing and I shut up: it turns out that .NET which is probably one of the highest authorities on windows newline handling has the same behavior as python, this crate and go: https://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Multiline

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

@mitsuhiko Oh interesting. I should have known for Go, but it’s interesting to see that Python doesn’t do it either:

>>> import re
>>> re.match('foo$', 'foo\n', re.MULTILINE) is not None
True
>>> re.match('foo$', 'foo\r\n', re.MULTILINE) is not None
False

So I guess we’re in good company?

But the existing implementation is more wrong, so I’m not sure I understand that as an objection.