runtime: System.Xml.XmlDocument.Save() unexpectedly creates UTF-8 files *with BOM*

System.Xml.XmlDocument.Save(), when given a file path to write to, unexpectedly creates UTF-8 encoded files with BOM if the document’s XML declaration has an encoding="UTF-8" attribute.

(By contrast, the absence of the encoding attribute or absence of an XML declaration altogether causes a BOM-less UTF-8 file to be created, as expected.)

Repro in PowerShell Core (on any supported platform; PowerShell’s [xml] cast instantiates a System.Xml.XmlDocument from the specified XML string):

([xml] '<?xml version="1.0" encoding="UTF-8" ?><foo />').Save("$HOME/t.xml")
Get-Content -AsByteStream $HOME/t.xml | Select-Object -First 3

The above yields:

239
187
191

which are the byte values of the UTF-8 BOM (0xef 0xbb 0xbf)

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 30 (16 by maintainers)

Most upvoted comments

I agree with the comment at https://github.com/dotnet/runtime/issues/28218#issuecomment-791742745. The default Encoding.UTF8 singleton should be no-BOM, which would the behavior of creating a fresh new UTF8Encoding() instance. The BOM is a crutch that may have had a purpose 20 years ago when .NET was first introduced, but nowadays it doesn’t serve a legitimate purpose.

Note: readers may still want to honor BOM if present. Changing Encoding.UTF8 to be no-BOM by default should only affect writers, telling them not to emit the BOM when generating new documents.

Of course, changing Encoding.UTF8 to be no-BOM may ripple throughout the .NET ecosystem, so a change of that magnitude would need to go through some kind of compat review. But I’d be strongly supportive of it under “standards-compliant” pillar that we have set for ourselves.

@sharwell I’ve personally have seen more people having issues with BOM than lack of it because they didn’t understand it. IMO there is currently more tools/editors which will default to UTF-8 than than tools which default to different encoding than UTF-8 and will have issues displaying it but I do not have specific metrics. Unix tools are good example - they all assume UTF-8 without BOM.

Looking at the stats: https://w3techs.com/technologies/history_overview/character_encoding/ms/y it made more sense to not fix this issue couple years ago than now. IMO we should try to unify encoding on the internet and UTF-8 seems most dominant at the moment. If any tool can’t open it or doesn’t default to UTF-8 it’s probably a misdesign as per stats it won’t be able to correctly handle majority of the documents on the web.

@sharwell I initially felt similar push back and I’ve also had to deal with weird encoding issues in the past but at the same time:

  • I do not expect any reasonable XML parser to break on a UTF-8 document with explicit charset set to UTF-8
  • The proposed behavior feels more consistent
  • It does in fact make integrating with Unix tools better
  • Suggested solution has a workaround if someone needs old behavior back

on the other hand there is:

  • possibility of people porting older code to newer version of the framework which could possibly break on some odd XML parser

Having said that I do not feel super strongly about fixing this issue but it feels better to me to fix it than to document the inconsistency as normal behavior.

I think we should remove BOM and default to UTF-8 for all XML APIs. I think these days BOM is rather unexpected and UTF-8 is what most of the web uses

/cc @buyaa-n my recommendation is to not change the behavior of this API

I was convinced to update this when we were triaging this in the doc repo, with the above comments, I convinced even more that we need this 😄

tagging @GrabYourPitchforks if he has any concerns

@sharwell, to add to @krwq’s comments:

Concern about Unix tools choking on a UTF-8 BOM was indeed what motivated me to create this issue in the first place.

As for your text-editor concerns: I’ve looked at the behavior of popular editors available on Windows with respect to BOM-less UTF-8:

Atom, Sublime Text 3, Visual Studio Code, Notepad++, Notepad:

  • Correctly read BOM-less UTF-8-encoded files.
  • Create BOM-less UTF-8 files by default (for Notepad, this was different up to at least Windows 7; not sure about Windows 8.1).

(When reading a BOM-less file:

  • Atom and Visual Studio Code blindly assume UTF-8 and therefore potentially misread files.
  • Notepad, Notepad++, and Sublime Text 3 fall back to ANSI encoding, presumably on encountering invalid-as-UTF-8 bytes in the input.)

@mklement0 I’d move the “utf-8” check to the inner if and that’s a fine fix as long as:

    XmlDocument doc = ...; // construct your doc here
    XmlWriterSettings settings = new XmlWriterSettings();
    settings.Encoding = Encoding.UTF8;
    XmlWriter writer = XmlWriter.Create(stream, settings);
    doc.Save(writer);

will cause the BOM to show up.

Please make sure to add test case with both of those when creating a fix

I do not know the Powershell syntax for this but from Xml perspective this behavior looks correct and expected. If you are able to specify the encoding as: new UTF8Encoding(false) (see: https://docs.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=netframework-4.7.2#System_Text_UTF8Encoding__ctor_System_Boolean_ for more details) then I believe the BOM should disappear. Here is the example in C#:

    XmlDocument doc = ...; // construct your doc here
    XmlWriterSettings settings = new XmlWriterSettings();
    settings.Encoding = new UTF8Encoding(false);
    XmlWriter writer = XmlWriter.Create(stream, settings);
    doc.Save(writer);

I’m closing this as by design, if you can still see BOM after setting the encoding like this please reopen.