runtime: [API Proposal]: Ascii.ToUtf16 overload that treats `\0` as invalid

Background and motivation

For ASP.NET Core’s StringUtilities the ASCII values of the range (0x00, 0x80) are considered valid, whilst Ascii.ToUtf16 treats the whole ASCII range [0x00, 0x80) as valid. In order to base StringUtilities on the Ascii-APIs and avoid custom vectorized code in ASP.NET Core internals \0 should be allowed to be treated as invalid. See https://github.com/dotnet/aspnetcore/issues/45962 for further info.

API Proposal

namespace System.Buffers.Text
{
    public static class Ascii
    {
        // existing methods
+       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
    }
}

The new ASCII-APIs will get added to .NET 8, so w/o breaking change an optional argument could be added.

namespace System.Buffers.Text
{
    public static class Ascii
    {
        // existing methods
-       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten);
+       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
    }
}

API Usage

    private static unsafe void GetHeaderName(ReadOnlySpan<byte> source, Span<char> buffer)
    {
        OperationStatus status = Ascii.ToUtf16(source, buffer, out _, out _, treatNullAsInvalid: true);

        if (status != OperationStatus.Done)
        {
            KestrelBadHttpRequestException.Throw(RequestRejectionReason.InvalidCharactersInHeaderName);
        }
    }

Alternative Designs

No response

Risks

The value for treatNullAsInvalid will be given as constant, so the JIT should be able to dead-code eliminate any code needed for “default case” (whole ASCII-range incl. \0), so no perf-regression should be expected.

Besides treating \0 as special value which is optinally treated as invalid I don’t expect any other value to be considered special enough for optional exclusion.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 24 (23 by maintainers)

Most upvoted comments

Yeah, I’ll pick it up and see what the team thinks

I am strongly against this proposal. ASCII is defined as characters in the range 0x00 .. 0x7F, inclusive. Sometimes a protocol will exclude certain characters (0x00, or the entire control character range 0x00 .. 0x1F and 0x7F), but at that point you’re making something tied to a particular protocol rather than something that is a general-purpose ASCII API. It’s similar to the reason we don’t support WTF-8 within any of our UTF-8 APIs: certain protocols may utilize it, but it doesn’t belong in a general-purpose UTF-8 processing API.

Since this is protocol-specific for aspnet, I recommend the code remain in that project.

Wouldn’t it be better to treat \0 as special at the highest level, instead of at the lowest level?

For example, I think that when parsing HTTP 1 headers, you could look for \r, \n or \0 as the first step, instead of just \r and \n, and deal with \0 at that point. That would then mean you could safely use the current version of Ascii.ToUtf16 to convert the header bytes to UTF-16.

headers average 800 bytes to 2kB with cookies

Thanks for these numbers! Out of interest: how / where did you get these from?

Googled average headers size 😅

As an anecdote going to Google homepage logged in my headers are 2158 bytes and it makes 27 requests to that domain (26 to other domains); so in total 58kB for that page and one domain