runtime: Make enum RegexParseError and RegexParseException public
Background and Motivation
A regular expression object made with System.Text.Regex is essentially an ad-hoc compiled sub-language that’s widely used in the .NET community for searching and replacing strings. But unlike other programming languages, any syntax error is raised as an ArgumentException. Programmers that want to act on specific parsing errors need to manually parse the error string to get more information, which is error-prone, subject to change and sometimes non-deterministic.
We already have an internal RegexParseException and two properties: Error and Offset, which respectively give an enum of the type of error and the location in the string where the error is located. When presently an ArgumentException is raised, it is in fact a RegexParseException which inherits ArgumentException.
I’ve checked the existing code and I propose we make RegexParseException and RegexParseError public, these are pretty self-describing at the moment, though the enum cases may need better named choices (suggested below) . Apart from changing a few existing tests and adding documentation, there are no substantive changes necessary.
Use cases
- Online regex tools may use the more detailed info to suggest corrections of the regex to users (like: “Did you forget to escape this character?”).
- Debugging experience w.r.t. regular expressions improves.
- Currently, getting col and row requires parsing the string, and isn’t in the string in Framework. Parsing in i18n scenarios is next to impossible, giving an enum and position helps writing better, and deterministic code
- Improve tooling by using the offset to place squiggles under errors in regexes.
- Self-correcting systems may use the extra info to close parentheses or brackets, or fix escape sequences that are incomplete.
- It is simply better to be able to check for explicit errors than the more generic
ArgumentExceptionwhich is used everywhere. - BCL tests on regex errors now uses reflection, this is no longer necessary.
Related requests and proposals
- In this comment (https://github.com/dotnet/runtime/issues/13942#issuecomment-70185134) @terrajobst first proposed to open-up this exception for improving error handling.
- Original proposal by @danmosemsft: https://github.com/dotnet/runtime/issues/1080
- Tooling improvements with squiggles: https://github.com/dotnet/corefx/pull/29178#discussion_r182842595 and https://github.com/dotnet/runtime/issues/372#issuecomment-559245612
Proposed API
The current API already exists but isn’t public. The definitions are as follows:
[Serializable]
- internal sealed class RegexParseException : ArgumentException
+ public class RegexParseException : ArgumentException
{
private readonly RegexParseError _error; // tests access this via private reflection
/// <summary>Gets the error that happened during parsing.</summary>
public RegexParseError Error => _error;
/// <summary>Gets the offset in the supplied pattern.</summary>
public int Offset { get; }
public RegexParseException(RegexParseError error, int offset, string message) : base(message)
{
+ // add logic to test range of 'error' and return UnknownParseError if out of range
_error = error;
Offset = offset;
}
public override void GetObjectData(SerializationInfo info, StreamingContext context)
{
base.GetObjectData(info, context);
info.SetType(typeof(ArgumentException)); // To maintain serialization support with .NET Framework.
}
}
And the enum with suggested names for a more discoverable naming scheme. I followed “clarity over brevity” and have tried to start similar cases with the same moniker, so that an alphabetic listing gives a (somewhat) logical grouping in tooling.
I’d suggest we add a case for unknown conditions, something like UnknownParseError = 0, which could be used if users create this exception by hand with an invalid enum value.
Handy for implementers: Historical view of this prior to 22 July 2020 shows the full diff for the enum field by field. On request, it shows all as an addition diff now, and is ordered alphabetically.
-internal enum RegexParseError
+public enum RegexParseError
{
+ UnknownParseError = 0, // do we want to add this catch all in case other conditions emerge?
+ AlternationHasComment,
+ AlternationHasMalformedCondition, // *maybe? No tests, code never hits
+ AlternationHasMalformedReference, // like @"(x)(?(3x|y)" (note that @"(x)(?(3)x|y)" gives next error)
+ AlternationHasNamedCapture, // like @"(?(?<x>)true|false)"
+ AlternationHasTooManyConditions, // like @"(?(foo)a|b|c)"
+ AlternationHasUndefinedReference, // like @"(x)(?(3)x|y)" or @"(?(1))"
+ CaptureGroupNameInvalid, // like @"(?< >)" or @"(?'x)"
+ CaptureGroupOfZero, // like @"(?'0'foo)" or @("(?<0>x)"
+ ExclusionGroupNotLast, // like @"[a-z-[xy]A]"
+ InsufficientClosingParentheses, // like @"(((foo))"
+ InsufficientOpeningParentheses, // like @"((foo)))"
+ InsufficientOrInvalidHexDigits, // like @"\uabc" or @"\xr"
+ InvalidGroupingConstruct, // like @"(?" or @"(?<foo"
+ InvalidUnicodePropertyEscape, // like @"\p{Ll" or @"\p{ L}"
+ MalformedNamedReference, // like @"\k<"
+ MalformedUnicodePropertyEscape, // like @"\p{}" or @"\p {L}"
+ MissingControlCharacter, // like @"\c"
+ NestedQuantifiersNotParenthesized // @"abc**"
+ QuantifierAfterNothing, // like @"((*foo)bar)"
+ QuantifierOrCaptureGroupOutOfRange,// like @"x{234567899988}" or @"x(?<234567899988>)" (must be < Int32.MaxValue)
+ ReversedCharacterRange, // like @"[z-a]" (only in char classes, see also ReversedQuantifierRange)
+ ReversedQuantifierRange, // like @"abc{3,0}" (only in quantifiers, see also ReversedCharacterRange)
+ ShorthandClassInCharacterRange, // like @"[a-\w]" or @"[a-\p{L}]"
+ UndefinedNamedReference, // like @"\k<x>"
+ UndefinedNumberedReference, // like @"(x)\2"
+ UnescapedEndingBackslash, // like @"foo\" or @"bar\\\\\"
+ UnrecognizedControlCharacter, // like @"\c!"
+ UnrecognizedEscape, // like @"\C" or @"\k<" or @"[\B]"
+ UnrecognizedUnicodeProperty, // like @"\p{Lll}"
+ UnterminatedBracket, // like @"[a-b"
+ UnterminatedComment,
}
* About IllegalCondition, this is thrown inside a conditional alternation like (?(foo)x|y), but appears to never be hit. There is no test case covering this error.
Usage Examples
Here’s an example where we use the additional info to give more detailed feedback to the user:
public class TestRE
{
public static Regex CreateAndLog(string regex)
{
try
{
var re = new Regex(regex);
return re;
}
catch(RegexParseException reExc)
{
switch(reExc.Error)
{
case RegexParseError.TooFewHex:
Console.WriteLine("The hexadecimal escape contains not enough hex characters.");
break;
case RegexParseError.UndefinedBackref:
Console.WriteLine("Back-reference in position {0} does not match any captures.", reExc.Offset);
break;
case RegexParseError.UnknownUnicodeProperty:
Console.WriteLine("Error at {0}. Unicode properties must exist, see http://aka.ms/xxx for a list of allowed properties.", reExc.Offset);
break;
// ... etc
}
return null;
}
}
}
Alternative Designs
Alternatively, we may remove the type entirely and merely throw an ArgumentException. But it is likely that some people rely on the internal type, even though it isn’t public, as through reflection the contextual information can be reached and is probably used in regex libraries. Besides, removing it will make any future improvements in dealing with parsing errors and proposing fixes in GUIs much harder to do.
Risks
The only risk I can think of is that after exposing this exception, people would like even more details. But that’s probably a good thing and only improves the existing API.
Note that:
- Existing code that checks for
ArgumentExceptioncontinues to work. - While debugging, people who see the underlying exception type can now actually use it.
- Existing code using reflection to get to the extra data may or may not not continue to work, it depends on how strict the search for the type is done.
[danmose: made some more edits]
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 4
- Comments: 84 (84 by maintainers)
Video
ArgumentException(orException). To preserve cross-framework deserialization, like when someone deserializes this exception on .NET Framework (or earlier versions of .NET Core), we need to keep serializing asArgumentException, not asRegexParseException.ErrorandOffsetare lost when crossing framework boundaries. Hence, we should also improve the message.We’d welcome other contributions if you’re interested @abelbraaksma . There’s plenty up for grabs. If you want another in regex specifically there are opportunities for perf improvement or features eg https://github.com/dotnet/runtime/issues/25598 (that one I think is a bit involved…)
For posterity: this has now been implemented, details are in the PR: #40902. Thanks to all for reviewing this proposal and for the valuable feedback. Special thanks to @danmosemsft to help get this past the finish line, and assisting with the parts in this process I was yet unfamiliar with.
It was in the nick of time to make it into .NET 5 🥳.
@danmosemsft, thanks for the pointers on those tests! I’m actually not sure I can finish today for the simple reason that I couldn’t get the build succeed (only to find out that I needed to update VS, oops). Behind a relatively slow connection, the download of 5GB for the update can take a while…
Oh, and I get 502’s from
https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/a8a526e9-91b3-4569-ba2d-ff08dbb7c110/nuget/v3/flat2/runtime.win-x86.microsoft.netcore.coredistools/1.0.1-prerelease-00005/runtime.win-x86.microsoft.netcore.coredistools.1.0.1-prerelease-00005.nupkg, but my hope is that they’ll disappear magically.While I think the changes i made are sound (the renaming part, at least), I’d prefer to at least successfully build it locally again before submitting a PR.
Anyway, I expect to submit a working PR somewhere tonight (which may still be day for you 😃. We’ll see how far we get.
@abelbraaksma I don’t know your schedule - we don’t branch until Monday, if that helps.
That’s what that test does, basically
@terrajobst, @pgovind, Thanks, you’re right, I misread: it’s indeed today 😉 (I actually clicked the meeting link and it said there was nobody, I drew the wrong conclusion 😛). As it looks now, I’ll be online by then.
And yes, I’m based in Amsterdam, which is CEST.
You didn’t miss it. It’s tomorrow (Aug 11) 😃
Ha, you’re right, I should’ve said “relatively close” 😃 Even though there’s 14 on that list now, some of them may get punted or new ones added for various reasons. And that’s why it’s hard to pick an exact date where an issue from the backlog will definitely get picked up 😃 Being present is not necessary, but you are of course always welcome to tune it to the meeting stream and if this issue gets picked up, I believe you can chat with the reviewers there? Otherwise, any questions/(changes to the API) that arise will be posted here and we’ll have a chance to review them.
Just FYI, I’ve been dialing into the API review meetings to keep an eye on this when it comes up. It hasn’t come up yet, but it’s close(unless something else takes priority).
@abelbraaksma it’s working now, @terrajobst fixed the sub it was using. thanks for letting us know.
Yup, I’ll take this forward at API review. @terrajobst , any idea when we might be able to squeeze this in?
@pgovind perhaps you can rep this at the API review sometime?
If there’s any chance the exception type could be used on .NET Framework, it should still be
[Serializable], for exceptions-across-AppDomains concerns. If it’s (semi-guaranteed) .NET5±only, then I think we’re now OK with not making it[Serializable].Thanks for writing this up.
Do you have evidence folks need to do this? If not, we might leave them out - that’s up to the API review folks.
For the enum,
Offset should note it’s zero based.
If we have the constructor it should probably verify RegexParseError isn’t out of range, like FileStream does, but that’s implementation. (And offset >= 0)
@CyrusNajmabadi any feedback given you’ve done something similar? specifically on the categories in the enum.