runtime: Breaking change proposal: Encoding.UTF8 singleton should not have a BOM

tl;dr

The Encoding.UTF8 singleton currently says “please emit a BOM when writing.” This is an anachronism. Nowadays, it should say “please do not emit a BOM when writing.”

The Encoding.UTF8 singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.

Discussion

More information: https://github.com/dotnet/standard/issues/260, https://github.com/dotnet/runtime/issues/7779, with further discussion at https://github.com/dotnet/runtime/issues/28218

Historically, the Encoding.UTF8 singleton has been equivalent to new UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false). This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn’t yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.

The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.

https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf (guidance being drafted, but not yet adopted)
https://corp.unicode.org/pipermail/unicode/2020-October/009070.html (previous discussion on this issue which led to above draft guidance)
https://corp.unicode.org/pipermail/unicode/2020-June/008713.html (earlier discussion on this issue)

This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed Encoding.Default to be UTF-8 w/o BOM across all OSes. We also changed UTF8Encoding to be more standards-compliant when it comes to replacing ill-formed input sequences with U+FFFD chars.

Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate Encoding based on that data.

This proposal does not suggest changing the BOM behavior for Encoding.UTF32, Encoding.Unicode, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a “this data is not UTF-8!” marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 68
Comments: 23 (20 by maintainers)

Links to this issue

Powershell saving XML and preserving format - Stack Overflow

Most upvoted comments

I am a supporter of this proposal. We need to provide a config switch to go back to old behavior if needed.

tarekgh on Apr 15, 2021

@krwq The idea is that all UTF-8 factories hanging off Encoding will be no-BOM, unless the caller calls new UTF8Encoding(true).

GrabYourPitchforks on May 10, 2021

It seems confusing if Encoding.UTF32 and Encoding.Unicode emit BOM, but Encoding.UTF8 does not.

UTF32 and UTF16 need a Byte Order Mark to indicate the endianness of the data apart from anything else; UTF8 doesn’t have any endianness so doesn’t require it for that purpose.

Also while ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added.

benaadams on Apr 16, 2021

IMO this is such a significant (and difficult to discover) breaking change that “provide a config switch to go back to old behavior” is not sufficient. I propose:

obsolete Encoding.UTF8 with warning
add Encoding.UTF8IncludingBom and Encoding.UTF8ExcludingBom (or equivalent)

SimonCropp on Apr 16, 2021

We discussed a little bit internally the idea of having Encoding.UTF8NoBOM as a first-class citizen alongside Encoding.UTF8. That suggestion has come up a few times in this thread as well.

I’m not sold on that as a good long-term solution. The spirit of this work item is that we want to reduce the number of developers who are exposed to the concept of a BOM. By having static factories for “with BOM” and “without BOM”, we’d be foisting this concept upon every developer who starts typing Encoding.* in their code editor. A developer who is well-versed in these concepts can quickly and correctly answer the question of “do I want a BOM or not?”, but for the majority of the developer audience these terms would be unfamiliar and they wouldn’t know how to answer the question. Ultimately I think exposing these concepts on a primary API would result in a poorer user experience than exists today.

GrabYourPitchforks on Apr 17, 2021

Good catch, @krwq: Encoding.GetEncoding("utf-8") does emit a BOM (though you wouldn’t be able to tell from the the description of the .Preamble property, which doesn’t mention .GetEncoding() and explicitly states that only using System.Text.Encoding.UTF8 and using the argument-less or no-BOM UTF8Encoding constructor results in a BOM).

By contrast, System.Text.Encoding.GetEncoding(0), documented to return UTF-8 in .NET Core, does not emit a BOM.

(A systematic review of the docs with respect to recommending / discouraging a UTF-8 BOM is called for either way, as certain pages contradict each other.)

I think consistency is called for, and my vote is to consistently default to BOM-less UTF-8 and only ever return a with-BOM instance if explicitly requested.

While undoubtedly a breaking change, @GrabYourPitchforks has already made compelling (to me) arguments for it in the initial post; let me add a few points:

The Unix world has moved to BOM-less UTF-8 a long time ago, and it is primarily tools with a Unix heritage that do not expect a BOM, and in the presence of one either choke or misinterpret the BOM as part of the data.
The Windows world is undoubtedly moving towards assuming UTF-8 in the absence of a BOM as well:
- The major (cross-platform) text editors nowadays write and read BOM-less UTF-8 by default - see https://github.com/dotnet/runtime/issues/28218#issuecomment-795925634 for an overview.
- PowerShell Core (the cross-platform edition built on .NET Core / 5+) too uses BOM-less UTF-8 as its consistent default, both when reading its source code and in its file-processing cmdlets.
- Node.js (node.exe) - and others? - have chosen to “speak” (BOM-less) UTF-8 by default, irrespective of the active OEM code page as determined by the system locale (aka language for non-Unicode programs).
  - Python chose a different (also nonstandard) approach, defaulting to the active ANSI code page even when called from the console, rather than the OEM code page console applications are expected to use. However, it is is trivial to configure python to use (BOM-less) UTF-8 instead, via an environment variable (PYTHONUTF8) or, situationally, via a CLI parameter (-X utf8).
- Windows 10 now offers a - still-in-beta as of this writing - feature to switch to (BOM-less) UTF-8 system-wide, by setting the system locale so that both the OEM and the ANSI code pages use code page 65001, i.e. UTF-8; see this Stack Overflow answer for details and a discussion of the ramifications.
  - With this configuration:
    - Even Windows PowerShell and Python, for instance, then default to BOM-less UTF-8 (since the ANSI code page is then effectively UTF-8).
    - So will all conventional console applications that use the OEM code page, with the caveat that legacy applications that aren’t equipped to handle the variable-length aspect of UTF-8 encoding malfunction.
- Last but not least: .NET’s own default encoding for its System.IO APIs has - commendably - been BOM-less UTF-8 since v1.

mklement0 on May 10, 2021

I wonder if default encoding can be changed from ANSI to UTF8, as a breaking change. In practice, especially for East Asian users, ASNI codepages are much more annoying than BOM.

huoyaoyuan on Apr 16, 2021

Ah, I see. GetBytes never returns a BOM, but a StreamWriter will first write the result of GetPreamble and then the result of GetBytes.

heemskerkerik on Apr 16, 2021

I am confused. The documentation says that it does include a BOM. Which method does that? When I try GetBytes, it doesn’t appear to return a BOM:

var b = System.Text.Encoding.UTF8.GetBytes("Hello world");

foreach (byte bb in b)
    Console.WriteLine(bb.ToString("X2"));

This outputs (on .NET 5):

heemskerkerik on Apr 16, 2021