runtime: CultureInfo.TextInfo.ListSeparator broken in .Net 5
As part of updating of updating the Csv class in Signum.Utilities to .Net 5, I’ve realize that the value for ListSeparator
has changed for many cultures that use ,
instead of .
for decimals separators.
foreach (var ci in new [] { "", "en", "en-GB", "en-US", "es", "es-ES", "de", "de-DE", "fr", "fr-FR" })
{
Console.WriteLine($"Culture {ci}\tListSeparator {CultureInfo.GetCultureInfo(ci).TextInfo.ListSeparator}");
}
netcoreapp3.1
Culture ListSeparator ,
Culture en ListSeparator ,
Culture en-GB ListSeparator ,
Culture en-US ListSeparator ,
Culture es ListSeparator ;
Culture es-ES ListSeparator ;
Culture de ListSeparator ;
Culture de-DE ListSeparator ;
Culture fr ListSeparator ;
Culture fr-FR ListSeparator ;
net5 rc2
Culture ListSeparator ,
Culture en ListSeparator ,
Culture en-GB ListSeparator ,
Culture en-US ListSeparator ,
Culture es ListSeparator .
Culture es-ES ListSeparator .
Culture de ListSeparator .
Culture de-DE ListSeparator .
Culture fr ListSeparator
Culture fr-FR ListSeparator
Both tests in the same machine.
Previously I was using ListSeparator
as a way to detect the separator that will be used in a Csv file. Is this the intended purpose?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 61 (37 by maintainers)
Commits related to this issue
- temp fix ListSeparator https://github.com/dotnet/runtime/issues/43795 — committed to signumsoftware/framework by olmobrutall 4 years ago
Sorry @tarekgh, but to me this surely looks like a bug. If you look at the relevant code at https://github.com/dotnet/runtime/blob/0e402bcd28f38a3b65720474dcd7a338cab0841d/src/libraries/Native/Unix/System.Globalization.Native/pal_localeStringData.c#L246 you can see that it isn’t even implemented (falls through to ThousandsSeparator). So either this is a bug in .Net Core or the info is generally not available in ICU, which would then still be a bug (And would not be an issue for CLDR if it is not part of it). As native german I can tell that using . as a list separator does not make any sense.
Windows PowerShell 5.1 documents the use of ListSeparator in CSV, and PowerShell 7 continues that; see the
-UseCulture
switch in ConvertFrom-Csv, ConvertTo-Csv, Export-Csv, and Import-Csv. I don’t think it has been filed as a bug, although the behavior is mentioned in https://github.com/PowerShell/PowerShell/issues/11754.I have been using ListSeparator for human-readable output, never for parsing. I suspect that hardcoding
","
or", "
would be acceptable for more cultures than copying the ThousandsSeparator.I won’t repeat myself about the sources for the problem but in the end I see three issues:
Issue 1: The current output is unreasonable for major parts of the world. This is NOT complaining that it would have changed if it would have changed to something reasonable. But it did not. At least not for several/most non-en cultures.
Issue 2: Usage as a CSV separator: @tarekgh understands that “people used this APIs in wrong way when used it for CSV”. It doesn’t seem to matter that people used ListSeparator for over 2 decades for this purpose. Excel uses it, SPSS uses it, Powershell uses it (and yes I know NOT for 20 years), basically every general purpose CSV processor uses and/or supports it but they are all doing it wrong. And they all did it wrong for 20 years and more and didn’t even notice. But irony aside: This is a special case of Issue 1: The current output is unreasonable for use as a CSV separator for major/relevant parts of the cultures.
Issue 3: Usage for Excel-COM-Interop: That would really be an issue if ICU defined a list separator that was differing from the windows one. However it doesn’t even define one, so this seems pretty made up. And IF it defined one I would be pretty condident that it would at least be some reasonable one for the corresponding culture.
And just something that made me smile (no good smile): @tarekgh s “I was trying to say neither .NET nor Windows documented anything saying that.” With that argument you can also say that you must not use a String to store song lyrics because I bet that “neither .NET nor Windows documented anything saying that.”
@tarekgh what’s wrong with Number symbols list element (https://unicode.org/reports/tr35/tr35-numbers.html#Number_Symbols)? Because for the few European cultures I checked that have ‘;’ as a list separator on Windows, this property is also ‘;’ in CLDR.
Also, the list pattern you mention is a generic list separator for itemizing AFAICT, whereas sList/ListSeparator is used to group numbers (where ‘,’ is not possible for European cultures, due to that character already being the decimal separator – and I can attest to this European usage of ”;” in case of ambiguities in case of writing number sequences and intervals).
The fact that all other characters in the Numbers section in the regional settings also correspond to the symbols section from CLDR also points at that particular solution.
Yes, ListSeparatorn is wonderfully vague on what kind of lists it is talking about, but judging from the context of the regional settings and its use by lots of software this way (Excel, CSV), I think the list symbol makes much more sense, and is probably mostly backward compatible with NLS data.
FWIW, you can also find that data here: https://github.com/unicode-org/icu/tree/master/icu4c/source/data/locales
i.e, for en locale: https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/en.txt#L2014-L2019
Every locale has a properties listed under
listPattern
section, something likeYou can reach the data if you download the CLDR http://unicode.org/Public/cldr/38/cldr-common-38.0.zip then open the
main
folder which include locale files. You can look at root locale files (e.g.en.xml
) to find all properties including the list pattern.What I actually did is I called ICU APIs that format the list and then I extracted the separator for every locale from the API output. let me know if you need more details.
Just to update, I’ll try to look at getting some fix in the 5.0 servicing release (something like 5.0.1) so we don’t have to wait to 6.0.
I think that focusing on Excel and CSV files all the time muddles the issue: using the thousands separator as a list separator is the worst possible fallback that could have been used. Hard-coding a comma would even be better (albeit not what Excel would expect).
A “.” or non-breaking space is never a list separator, AFAICT. The choice for the thousands separator works exclusively for locales using English-based number formatting, that happens to use the comma, almost by accident.
As much as I hate this option but even throwing a PlatformNotSupportedException would be far better than the current situation. For a short term personally I’d be pragmatic and try to find a heuristic that at least handles the 95% cases (Like all EN-Cultures get “,”, all others get “;”).
I have seen three types of usage in our code and the current situation breaks all of them:
User-(UI)-Text output of list elements. This is generally the most unproblematic, however would be very annoying because e.g. “.” is in NO culture worldwide a reasonable list separator, however several cultures (like DE) deliver that. Imagine: “Data: 34,43.34.42,43”
CSV Export/Import: CSV formats are usually/always culture specific. Using the ListSeparator is a reasonable default that will work in the majority of cases wordwide (or at least in countries for which we develop software).
3a) Excel-“Interfacing” through “excel.exe export.csv”: This currently works flawlessly because excel uses the same system list separator. This would be very inconvenient to implement yourself because it would require native dependecies to get user locale overrides (which excel also uses).
3b) Excel-Interfacting though COM-Automation: Although Excel offers functions to get internationalization these are known to not work in lots of situations (e.g. multilanguage Excel install but lots more) whereas list separator basically always works (unless VERY stupid user locale overrides, in which Excel itself will not work correctly anyways)
Considering the “workarounds”:
If I didn’t hear from you, I’ll go with the NLS data to keep the compatibility and in the future releases we can expose new APIs if needed for list patterns.
@rubenprins @KalleOlaviNiemitalo TextInfo.ListSeparator is not for numbers only. it can separate any kind of data. This is why list pattern would fit better. Also if you look at https://unicode.org/reports/tr35/tr35-numbers.html#Number_Symbols it is clearly says
Which means this number separator is not for linguistic while TextInfo.ListSeprator is a cultural property which should be linguistic. That is exactly match the list pattern in CLDR.
As @KalleOlaviNiemitalo indicated, even CLDR number separator will not be compatible with NLS. If really non-linguistic separator needed here, then I would say let it be
,
for all locales which I think will make CSV/Excel scenarios less broken.To reiterate on CSV/Excel, I am still not seeing a way can guarantee ListSeparator will make this scenario reliable. This is wrong usage of ListSeparator and apps should depend on the real Excel doc separator. We’ll keep ListSeparator read the user override for the current culture so most of the scenarios (as @KalleOlaviNiemitalo thankfully pointed at) like PoswerShell and COM interop will continue work as it used to be.
If people prefer the NLS compatibility, we can keep this but this is the least preferable to me because this is Windows specific and also this data can change at any time too in Windows. I am still thinking the data I mentioned https://github.com/dotnet/runtime/issues/43795#issuecomment-722049526 would be the most sensible data we can return for this separator. what you think?
CLDR 38 has a semicolon
;
without spaces as the number list separator in en.xml, and en_US.xml does not override it. So it’s not the same as the Windows NLS data, which has used a comma,
for en-US.Anyway, I now think the CLDR number list separator is a better fit for TextInfo.ListSeparator than what one could extract from list patterns, because it does not end with a space. An API for linguistic list formatting using CLDR list patterns could then be added in .NET 6 or 7 as a separate issue.
(The country-dependent list separator was apparently added to INT 21H AH=38H in MS-DOS 3.0, and only had room for one byte of text and a null character.)
@danmosemsft I added Preview 4 section in the doc and listed this separator issue under it.
And what is it intended for then? The first 1% of library for displaying list localized for humans?
How you plan to fit https://unicode-org.github.io/cldr-staging/charts/37/by_type/miscellaneous.displaying_lists.html#4f8acaf2d32aff3a into one string property?
Over an already ambiguous documentation
"Gets or sets the string that separates items in a list."
you put a warning discouraging the usage. Is there any real use case for this property using the new implementations? And how you justify the.
being returned fores
/de
/fr
/…? There is no language that uses.
as list separator, no matter the interpretation.I think the problem is clear:
The code is clearly broken, it was identified as broken in .Net 3.1 in UNIX and now is going to affect many people.
The proper implementation is not trivial. If ICU will have an equivalent of
LOCALE_SLIST
nobody would be trying to redefine the meaning ofListSeparator
, but it doesn’t.Also, you all are about to release .Net 5 and you don’t want/can’t make changes in the code, like marking it as
Obsolete
, so you’re trying to sell me that this is not a bug, it’s a preciously crafted fallback implementations that has been carefully designed.I can understand deadlines, but just be honest.
It’s true that the documentation of
ListSeparator
andLOCALE_SLIST
is intentionally ambiguous, but here more evidence that THE WORLD is using ListSeparator and LOCALE_SLIST for localized excel CSVs.https://community.intel.com/t5/Intel-Fortran-Compiler/CSV-delimiter-detection/td-p/956160 http://www.dataaccess.com/KBasePublic/KBPrint.asp?ArticleID=1388 https://it.mathworks.com/matlabcentral/answers/343998-how-can-i-recognize-automatically-the-list-separator-in-the-region-and-languages-settings http://www.vbaexpress.com/forum/showthread.php?42769-save-as-csv-file-format/page2 https://www.mrexcel.com/board/threads/reading-large-dataset-in-excel.929526/ https://forums.ni.com/t5/LabWindows-CVI/Creating-CSV-file-according-with-the-international-settings/td-p/3184019?profile.language=en http://archives.miloush.net/michkap/archive/2007/08/15/4396922.html https://engineertips.wordpress.com/2018/06/07/local-csv-separator-in-delphi/ https://www.generacodice.com/en/articolo/130293/How-to-read-'List-separator'-from-OS-in-Java http://www.office-loesung.de/ftopic87375_0_0_asc.php https://codereview.stackexchange.com/questions/126146/change-list-separator-parse-file-restore-list-separator-to-original-value https://www.experts-exchange.com/questions/24491939/Convert-a-xls-File-to-a-CSV-file-SemiColon-Seperated.html https://living-sun.com/excel/249689-how-to-export-excel-to-csv-file-with-ldquordquo-delimted-and-utf-8-code-excel-vba-csv-utf-8.html http://www.delphigroups.info/2/10/861201.html https://forums.codeguru.com/showthread.php?348853-Regional-config-of-panel-control
Do you really need more evidence?
I don’t think it is a good idea to throw or mark it obsolete. we are going to fix it in the next releases anyway. Anyone can switch back to the old behavior anyway using the config switch. And the current behavior is already used on Linux for long time now.
Thanks for all your feedback and discussion and feel free to ping us if you get any more questions.
Hi @tarekgh,
If you use Excel in Spain or Germany (any most (all?) of the continental Europe) and export to CSV instead of:
you get
Notice how the decimal numbers use
,
, so we use;
to separate cells.I’m sure there are ways to configure Excel to export the english way, but CSV is typically used for one time data-loading scenarios connecting different departments, provided by the customer or other third party company, downloaded from internet, etc… it is very convenient to be able to set the culture once, and the CSV library switches number format, date format, and separator.
By using the code that I mentioned (ups… corrected now):
This will work for the European cultures and for the English cultures, but if someone from the second group (for example latin america
es-DO, es-GT, es-HN, es-NI, es-PA, es-PE, es-PR
) exports from Excel to CSV, maybe it will produce something like this:(notice the
.
in the decimal number. Confirmation of this behavior pending!)And my simple heuristic won’t work, because it will assume that is english style and expect
,
to be the separator.@KalleOlaviNiemitalo I don’t think using
", "
or","
is useful for anybody, why useCultureInfo.Current.TextInfo.ListSeparator
instead of just ", "? Just so maybe they change it in the future and what you formatted is completely crazy in other languages?Remember that in Europe we use
,
to enumerate stuff, not;
. The;
is just a Excel CSV thing.Also, checking for
TextInfo.ListSeparator
in google, looks like the people using it right now are CSV libraries:https://dotnetfiddle.net/MAm1t1 https://github.com/JoshClose/CsvHelper/issues/918
While I have applications running in Docker for production, parsing Csv files is typically done in ETL processes that we run in windows, so it has gone unnoticed for now.
About
GetListSeparator
, the current implementation ofListSeparator
not only return null orstring.Empty
. Also returns.
in a lot of cases.For now I use:
This works for the first 180 cultures (english group) and the this group of 375 (europe group) but leaves the other cultures broken.
Can someone from this groups check what excel does when you export to CSV?