runtime: Make enum RegexParseError and RegexParseException public

Background and Motivation

A regular expression object made with System.Text.Regex is essentially an ad-hoc compiled sub-language that’s widely used in the .NET community for searching and replacing strings. But unlike other programming languages, any syntax error is raised as an ArgumentException. Programmers that want to act on specific parsing errors need to manually parse the error string to get more information, which is error-prone, subject to change and sometimes non-deterministic.

We already have an internal RegexParseException and two properties: Error and Offset, which respectively give an enum of the type of error and the location in the string where the error is located. When presently an ArgumentException is raised, it is in fact a RegexParseException which inherits ArgumentException.

I’ve checked the existing code and I propose we make RegexParseException and RegexParseError public, these are pretty self-describing at the moment, though the enum cases may need better named choices (suggested below) . Apart from changing a few existing tests and adding documentation, there are no substantive changes necessary.

Use cases

  • Online regex tools may use the more detailed info to suggest corrections of the regex to users (like: “Did you forget to escape this character?”).
  • Debugging experience w.r.t. regular expressions improves.
  • Currently, getting col and row requires parsing the string, and isn’t in the string in Framework. Parsing in i18n scenarios is next to impossible, giving an enum and position helps writing better, and deterministic code
  • Improve tooling by using the offset to place squiggles under errors in regexes.
  • Self-correcting systems may use the extra info to close parentheses or brackets, or fix escape sequences that are incomplete.
  • It is simply better to be able to check for explicit errors than the more generic ArgumentException which is used everywhere.
  • BCL tests on regex errors now uses reflection, this is no longer necessary.

Related requests and proposals

Proposed API

The current API already exists but isn’t public. The definitions are as follows:

    [Serializable]
-    internal sealed class RegexParseException : ArgumentException
+    public class RegexParseException : ArgumentException
    {
        private readonly RegexParseError _error; // tests access this via private reflection

        /// <summary>Gets the error that happened during parsing.</summary>
        public RegexParseError Error => _error;

        /// <summary>Gets the offset in the supplied pattern.</summary>
        public int Offset { get; }

        public RegexParseException(RegexParseError error, int offset, string message) : base(message)
        {
+            // add logic to test range of 'error' and return UnknownParseError if out of range
            _error = error;
            Offset = offset;
        }

        public override void GetObjectData(SerializationInfo info, StreamingContext context)
        {
            base.GetObjectData(info, context);
            info.SetType(typeof(ArgumentException)); // To maintain serialization support with .NET Framework.
        }
    }

And the enum with suggested names for a more discoverable naming scheme. I followed “clarity over brevity” and have tried to start similar cases with the same moniker, so that an alphabetic listing gives a (somewhat) logical grouping in tooling.

I’d suggest we add a case for unknown conditions, something like UnknownParseError = 0, which could be used if users create this exception by hand with an invalid enum value.

Handy for implementers: Historical view of this prior to 22 July 2020 shows the full diff for the enum field by field. On request, it shows all as an addition diff now, and is ordered alphabetically.

-internal enum RegexParseError
+public enum RegexParseError
{
+    UnknownParseError = 0,    // do we want to add this catch all in case other conditions emerge?
+    AlternationHasComment,
+    AlternationHasMalformedCondition,  // *maybe? No tests, code never hits
+    AlternationHasMalformedReference,  // like @"(x)(?(3x|y)" (note that @"(x)(?(3)x|y)" gives next error)
+    AlternationHasNamedCapture,        // like @"(?(?<x>)true|false)"
+    AlternationHasTooManyConditions,   // like @"(?(foo)a|b|c)"
+    AlternationHasUndefinedReference,  // like @"(x)(?(3)x|y)" or @"(?(1))"
+    CaptureGroupNameInvalid,           // like @"(?< >)" or @"(?'x)"
+    CaptureGroupOfZero,                // like @"(?'0'foo)" or @("(?<0>x)"
+    ExclusionGroupNotLast,             // like @"[a-z-[xy]A]"
+    InsufficientClosingParentheses,    // like @"(((foo))"
+    InsufficientOpeningParentheses,    // like @"((foo)))"
+    InsufficientOrInvalidHexDigits,    // like @"\uabc" or @"\xr"
+    InvalidGroupingConstruct,          // like @"(?" or @"(?<foo"
+    InvalidUnicodePropertyEscape,      // like @"\p{Ll" or @"\p{ L}"
+    MalformedNamedReference,           // like @"\k<"
+    MalformedUnicodePropertyEscape,    // like @"\p{}" or @"\p {L}"
+    MissingControlCharacter,           // like @"\c"
+    NestedQuantifiersNotParenthesized  // @"abc**"
+    QuantifierAfterNothing,            // like @"((*foo)bar)"
+    QuantifierOrCaptureGroupOutOfRange,// like @"x{234567899988}" or @"x(?<234567899988>)" (must be < Int32.MaxValue)
+    ReversedCharacterRange,            // like @"[z-a]"   (only in char classes, see also ReversedQuantifierRange)
+    ReversedQuantifierRange,           // like @"abc{3,0}"  (only in quantifiers, see also ReversedCharacterRange)
+    ShorthandClassInCharacterRange,    // like @"[a-\w]" or @"[a-\p{L}]"
+    UndefinedNamedReference,           // like @"\k<x>"
+    UndefinedNumberedReference,        // like @"(x)\2"
+    UnescapedEndingBackslash,          // like @"foo\" or @"bar\\\\\"
+    UnrecognizedControlCharacter,      // like @"\c!"
+    UnrecognizedEscape,                // like @"\C" or @"\k<" or @"[\B]"
+    UnrecognizedUnicodeProperty,       // like @"\p{Lll}"
+    UnterminatedBracket,               // like @"[a-b"
+    UnterminatedComment,
}

* About IllegalCondition, this is thrown inside a conditional alternation like (?(foo)x|y), but appears to never be hit. There is no test case covering this error.

Usage Examples

Here’s an example where we use the additional info to give more detailed feedback to the user:

public class TestRE
{
    public static Regex CreateAndLog(string regex)
    {
        try
        {
            var re = new Regex(regex);
            return re;
        }
        catch(RegexParseException reExc)
        {
            switch(reExc.Error)
            {
                case RegexParseError.TooFewHex:
                    Console.WriteLine("The hexadecimal escape contains not enough hex characters.");
                    break;
                case RegexParseError.UndefinedBackref:
                    Console.WriteLine("Back-reference in position {0} does not match any captures.", reExc.Offset);
                    break;
                case RegexParseError.UnknownUnicodeProperty:
                    Console.WriteLine("Error at {0}. Unicode properties must exist, see http://aka.ms/xxx for a list of allowed properties.", reExc.Offset);
                    break;
                // ... etc
            }
            return null;
        }
    }
}

Alternative Designs

Alternatively, we may remove the type entirely and merely throw an ArgumentException. But it is likely that some people rely on the internal type, even though it isn’t public, as through reflection the contextual information can be reached and is probably used in regex libraries. Besides, removing it will make any future improvements in dealing with parsing errors and proposing fixes in GUIs much harder to do.

Risks

The only risk I can think of is that after exposing this exception, people would like even more details. But that’s probably a good thing and only improves the existing API.

Note that:

  • Existing code that checks for ArgumentException continues to work.
  • While debugging, people who see the underlying exception type can now actually use it.
  • Existing code using reflection to get to the extra data may or may not not continue to work, it depends on how strict the search for the type is done.

[danmose: made some more edits]

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 4
  • Comments: 84 (84 by maintainers)

Most upvoted comments

Video

  • The type is currently serializable and people do serialize the exception by catching ArgumentException (or Exception). To preserve cross-framework deserialization, like when someone deserializes this exception on .NET Framework (or earlier versions of .NET Core), we need to keep serializing as ArgumentException, not as RegexParseException.
  • This also means that Error and Offset are lost when crossing framework boundaries. Hence, we should also improve the message.
  • We don’t want to have a separate public type from the internal one, but we should make sure that all enum members are actually used by the implementation/are reachable. If there are unused values, we should remove them.
  • We should probably preserve the current ordering, if only to discourage the urge to make future additions alphabetically sorted.
namespace System.Text.RegularExpressions
{
    [Serializable]
    public sealed class RegexParseException : ArgumentException
    {
        public RegexParseException(RegexParseError error, int offset);
        private RegexParseException(SerializationInfo info, StreamingContext context)
        {
            // It means someone modified the payload.
            throw new NotImplementedException();
        }
        public override void GetObjectData(SerializationInfo info, StreamingContext context)
        {
            // We'll serialize as an instance of ArgumentException
        }
        public RegexParseError Error { get; }
        public int Offset { get; }
    }
    public enum RegexParseError
    {
        Unknown,
        AlternationHasComment,
        AlternationHasMalformedCondition,
        AlternationHasMalformedReference,
        AlternationHasNamedCapture,
        AlternationHasTooManyConditions,
        AlternationHasUndefinedReference,
        CaptureGroupNameInvalid,
        CaptureGroupOfZero,
        ExclusionGroupNotLast,
        InsufficientClosingParentheses,
        InsufficientOpeningParentheses,
        InsufficientOrInvalidHexDigits,
        InvalidGroupingConstruct,
        InvalidUnicodePropertyEscape,
        MalformedNamedReference,
        MalformedUnicodePropertyEscape,
        MissingControlCharacter,
        NestedQuantifiersNotParenthesized,
        QuantifierAfterNothing,
        QuantifierOrCaptureGroupOutOfRange,
        ReversedCharacterRange,
        ReversedQuantifierRange,
        ShorthandClassInCharacterRange,
        UndefinedNamedReference,
        UndefinedNumberedReference,
        UnescapedEndingBackslash,
        UnrecognizedControlCharacter,
        UnrecognizedEscape,
        UnrecognizedUnicodeProperty,
        UnterminatedBracket,
        UnterminatedComment
    }
}

We’d welcome other contributions if you’re interested @abelbraaksma . There’s plenty up for grabs. If you want another in regex specifically there are opportunities for perf improvement or features eg https://github.com/dotnet/runtime/issues/25598 (that one I think is a bit involved…)

For posterity: this has now been implemented, details are in the PR: #40902. Thanks to all for reviewing this proposal and for the valuable feedback. Special thanks to @danmosemsft to help get this past the finish line, and assisting with the parts in this process I was yet unfamiliar with.

It was in the nick of time to make it into .NET 5 🥳.

@danmosemsft, thanks for the pointers on those tests! I’m actually not sure I can finish today for the simple reason that I couldn’t get the build succeed (only to find out that I needed to update VS, oops). Behind a relatively slow connection, the download of 5GB for the update can take a while…

Oh, and I get 502’s from https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/a8a526e9-91b3-4569-ba2d-ff08dbb7c110/nuget/v3/flat2/runtime.win-x86.microsoft.netcore.coredistools/1.0.1-prerelease-00005/runtime.win-x86.microsoft.netcore.coredistools.1.0.1-prerelease-00005.nupkg, but my hope is that they’ll disappear magically.

While I think the changes i made are sound (the renaming part, at least), I’d prefer to at least successfully build it locally again before submitting a PR.

Anyway, I expect to submit a working PR somewhere tonight (which may still be day for you 😃. We’ll see how far we get.

@abelbraaksma I don’t know your schedule - we don’t branch until Monday, if that helps.

That’s what that test does, basically

@terrajobst, @pgovind, Thanks, you’re right, I misread: it’s indeed today 😉 (I actually clicked the meeting link and it said there was nobody, I drew the wrong conclusion 😛). As it looks now, I’ll be online by then.

And yes, I’m based in Amsterdam, which is CEST.

it was only two hours

You didn’t miss it. It’s tomorrow (Aug 11) 😃

Ha, you’re right, I should’ve said “relatively close” 😃 Even though there’s 14 on that list now, some of them may get punted or new ones added for various reasons. And that’s why it’s hard to pick an exact date where an issue from the backlog will definitely get picked up 😃 Being present is not necessary, but you are of course always welcome to tune it to the meeting stream and if this issue gets picked up, I believe you can chat with the reviewers there? Otherwise, any questions/(changes to the API) that arise will be posted here and we’ll have a chance to review them.

Just FYI, I’ve been dialing into the API review meetings to keep an eye on this when it comes up. It hasn’t come up yet, but it’s close(unless something else takes priority).

@abelbraaksma it’s working now, @terrajobst fixed the sub it was using. thanks for letting us know.

@pgovind perhaps you can rep this at the API review sometime?

Yup, I’ll take this forward at API review. @terrajobst , any idea when we might be able to squeeze this in?

@pgovind perhaps you can rep this at the API review sometime?

If there’s any chance the exception type could be used on .NET Framework, it should still be [Serializable], for exceptions-across-AppDomains concerns. If it’s (semi-guaranteed) .NET5±only, then I think we’re now OK with not making it [Serializable].

Thanks for writing this up.

By exposing the constructors, people can overwrite the exception in favor of a more specific one.

Do you have evidence folks need to do this? If not, we might leave them out - that’s up to the API review folks.

For the enum,

  • is there some ordering that makes sense other than semi-random?
  • you might want to do some regularization eg, either “Ref” or “Reference” and not both “Ref” and “ref”.
  • maybe “IncompleteSlashP” should say something about unicode categories instead?
  • “Capnum” is not a publicly meaningful name
  • “Invalid” is probably better than “Illegal” in new API

Offset should note it’s zero based.

If we have the constructor it should probably verify RegexParseError isn’t out of range, like FileStream does, but that’s implementation. (And offset >= 0)

            if (access < FileAccess.Read || access > FileAccess.ReadWrite)
                throw new ArgumentOutOfRangeException(nameof(access), SR.ArgumentOutOfRange_Enum);

@CyrusNajmabadi any feedback given you’ve done something similar? specifically on the categories in the enum.