runtime: [API Proposal]: Ascii.ToUtf16 overload that treats `\0` as invalid
Background and motivation
For ASP.NET Core’s StringUtilities
the ASCII values of the range (0x00, 0x80)
are considered valid, whilst Ascii.ToUtf16
treats the whole ASCII range [0x00, 0x80)
as valid. In order to base StringUtilities
on the Ascii-APIs and avoid custom vectorized code in ASP.NET Core internals \0
should be allowed to be treated as invalid. See https://github.com/dotnet/aspnetcore/issues/45962 for further info.
API Proposal
namespace System.Buffers.Text
{
public static class Ascii
{
// existing methods
+ public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
}
}
The new ASCII-APIs will get added to .NET 8, so w/o breaking change an optional argument could be added.
namespace System.Buffers.Text
{
public static class Ascii
{
// existing methods
- public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten);
+ public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
}
}
API Usage
private static unsafe void GetHeaderName(ReadOnlySpan<byte> source, Span<char> buffer)
{
OperationStatus status = Ascii.ToUtf16(source, buffer, out _, out _, treatNullAsInvalid: true);
if (status != OperationStatus.Done)
{
KestrelBadHttpRequestException.Throw(RequestRejectionReason.InvalidCharactersInHeaderName);
}
}
Alternative Designs
No response
Risks
The value for treatNullAsInvalid
will be given as constant, so the JIT should be able to dead-code eliminate any code needed for “default case” (whole ASCII-range incl. \0
), so no perf-regression should be expected.
Besides treating \0
as special value which is optinally treated as invalid I don’t expect any other value to be considered special enough for optional exclusion.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 24 (23 by maintainers)
Yeah, I’ll pick it up and see what the team thinks
I am strongly against this proposal. ASCII is defined as characters in the range
0x00 .. 0x7F
, inclusive. Sometimes a protocol will exclude certain characters (0x00
, or the entire control character range0x00 .. 0x1F
and0x7F
), but at that point you’re making something tied to a particular protocol rather than something that is a general-purpose ASCII API. It’s similar to the reason we don’t support WTF-8 within any of our UTF-8 APIs: certain protocols may utilize it, but it doesn’t belong in a general-purpose UTF-8 processing API.Since this is protocol-specific for aspnet, I recommend the code remain in that project.
Wouldn’t it be better to treat
\0
as special at the highest level, instead of at the lowest level?For example, I think that when parsing HTTP 1 headers, you could look for
\r
,\n
or\0
as the first step, instead of just\r
and\n
, and deal with\0
at that point. That would then mean you could safely use the current version ofAscii.ToUtf16
to convert the header bytes to UTF-16.Googled average headers size 😅
As an anecdote going to Google homepage logged in my headers are 2158 bytes and it makes 27 requests to that domain (26 to other domains); so in total 58kB for that page and one domain