runtime: System.Xml.XmlDocument.Save() unexpectedly creates UTF-8 files *with BOM*
System.Xml.XmlDocument.Save()
, when given a file path to write to, unexpectedly creates UTF-8 encoded files with BOM if the document’s XML declaration has an encoding="UTF-8"
attribute.
(By contrast, the absence of the encoding
attribute or absence of an XML declaration altogether causes a BOM-less UTF-8 file to be created, as expected.)
Repro in PowerShell Core (on any supported platform; PowerShell’s [xml]
cast instantiates a System.Xml.XmlDocument
from the specified XML string):
([xml] '<?xml version="1.0" encoding="UTF-8" ?><foo />').Save("$HOME/t.xml")
Get-Content -AsByteStream $HOME/t.xml | Select-Object -First 3
The above yields:
239
187
191
which are the byte values of the UTF-8 BOM (0xef 0xbb 0xbf
)
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 30 (16 by maintainers)
I agree with the comment at https://github.com/dotnet/runtime/issues/28218#issuecomment-791742745. The default
Encoding.UTF8
singleton should be no-BOM, which would the behavior of creating a freshnew UTF8Encoding()
instance. The BOM is a crutch that may have had a purpose 20 years ago when .NET was first introduced, but nowadays it doesn’t serve a legitimate purpose.Note: readers may still want to honor BOM if present. Changing
Encoding.UTF8
to be no-BOM by default should only affect writers, telling them not to emit the BOM when generating new documents.Of course, changing
Encoding.UTF8
to be no-BOM may ripple throughout the .NET ecosystem, so a change of that magnitude would need to go through some kind of compat review. But I’d be strongly supportive of it under “standards-compliant” pillar that we have set for ourselves.@sharwell I’ve personally have seen more people having issues with BOM than lack of it because they didn’t understand it. IMO there is currently more tools/editors which will default to UTF-8 than than tools which default to different encoding than UTF-8 and will have issues displaying it but I do not have specific metrics. Unix tools are good example - they all assume UTF-8 without BOM.
Looking at the stats: https://w3techs.com/technologies/history_overview/character_encoding/ms/y it made more sense to not fix this issue couple years ago than now. IMO we should try to unify encoding on the internet and UTF-8 seems most dominant at the moment. If any tool can’t open it or doesn’t default to UTF-8 it’s probably a misdesign as per stats it won’t be able to correctly handle majority of the documents on the web.
@sharwell I initially felt similar push back and I’ve also had to deal with weird encoding issues in the past but at the same time:
on the other hand there is:
Having said that I do not feel super strongly about fixing this issue but it feels better to me to fix it than to document the inconsistency as normal behavior.
I think we should remove BOM and default to UTF-8 for all XML APIs. I think these days BOM is rather unexpected and UTF-8 is what most of the web uses
I was convinced to update this when we were triaging this in the doc repo, with the above comments, I convinced even more that we need this 😄
tagging @GrabYourPitchforks if he has any concerns
@sharwell, to add to @krwq’s comments:
Concern about Unix tools choking on a UTF-8 BOM was indeed what motivated me to create this issue in the first place.
As for your text-editor concerns: I’ve looked at the behavior of popular editors available on Windows with respect to BOM-less UTF-8:
Atom, Sublime Text 3, Visual Studio Code, Notepad++, Notepad:
(When reading a BOM-less file:
@mklement0 I’d move the “utf-8” check to the inner
if
and that’s a fine fix as long as:will cause the BOM to show up.
Please make sure to add test case with both of those when creating a fix
I’ve created a docs issue: https://github.com/dotnet/docs/issues/10326
I do not know the Powershell syntax for this but from Xml perspective this behavior looks correct and expected. If you are able to specify the encoding as:
new UTF8Encoding(false)
(see: https://docs.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=netframework-4.7.2#System_Text_UTF8Encoding__ctor_System_Boolean_ for more details) then I believe the BOM should disappear. Here is the example in C#:I’m closing this as by design, if you can still see BOM after setting the encoding like this please reopen.