runtime: [API Proposal]: Ascii.ToUtf16 overload that treats `\0` as invalid
Background and motivation
For ASP.NET Core’s StringUtilities the ASCII values of the range (0x00, 0x80) are considered valid, whilst Ascii.ToUtf16 treats the whole ASCII range [0x00, 0x80) as valid. In order to base StringUtilities on the Ascii-APIs and avoid custom vectorized code in ASP.NET Core internals \0 should be allowed to be treated as invalid. See https://github.com/dotnet/aspnetcore/issues/45962 for further info.
API Proposal
namespace System.Buffers.Text
{
public static class Ascii
{
// existing methods
+ public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
}
}
The new ASCII-APIs will get added to .NET 8, so w/o breaking change an optional argument could be added.
namespace System.Buffers.Text
{
public static class Ascii
{
// existing methods
- public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten);
+ public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
}
}
API Usage
private static unsafe void GetHeaderName(ReadOnlySpan<byte> source, Span<char> buffer)
{
OperationStatus status = Ascii.ToUtf16(source, buffer, out _, out _, treatNullAsInvalid: true);
if (status != OperationStatus.Done)
{
KestrelBadHttpRequestException.Throw(RequestRejectionReason.InvalidCharactersInHeaderName);
}
}
Alternative Designs
No response
Risks
The value for treatNullAsInvalid will be given as constant, so the JIT should be able to dead-code eliminate any code needed for “default case” (whole ASCII-range incl. \0), so no perf-regression should be expected.
Besides treating \0 as special value which is optinally treated as invalid I don’t expect any other value to be considered special enough for optional exclusion.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 24 (23 by maintainers)
Yeah, I’ll pick it up and see what the team thinks
I am strongly against this proposal. ASCII is defined as characters in the range
0x00 .. 0x7F, inclusive. Sometimes a protocol will exclude certain characters (0x00, or the entire control character range0x00 .. 0x1Fand0x7F), but at that point you’re making something tied to a particular protocol rather than something that is a general-purpose ASCII API. It’s similar to the reason we don’t support WTF-8 within any of our UTF-8 APIs: certain protocols may utilize it, but it doesn’t belong in a general-purpose UTF-8 processing API.Since this is protocol-specific for aspnet, I recommend the code remain in that project.
Wouldn’t it be better to treat
\0as special at the highest level, instead of at the lowest level?For example, I think that when parsing HTTP 1 headers, you could look for
\r,\nor\0as the first step, instead of just\rand\n, and deal with\0at that point. That would then mean you could safely use the current version ofAscii.ToUtf16to convert the header bytes to UTF-16.Googled average headers size 😅
As an anecdote going to Google homepage logged in my headers are 2158 bytes and it makes 27 requests to that domain (26 to other domains); so in total 58kB for that page and one domain