runtime: System.IO.Compression handles extended characters incorrectly

This issue has been moved from a ticket on Developer Community.


[severity:It’s more difficult to complete my work] create a text file name “tämä.txt” send to compressed folder Extracting this file it is read by library as “t„m„.txt”

Windows explorer expands the zip fine, as other compression software’s

using System.IO;
using System.IO.Compression;

namespace ZIPtest
{
    class Program
    {
        static void Main(string[] args)
        {
            string zipFile = @"C:\temp\tämä.zip";

using (ZipArchive paketti = ZipFile.OpenRead(zipFile))
            {
                foreach (ZipArchiveEntry entry in paketti. Entries)
                {
                    string nimi = Path.Combine(@"c:\temp", entry. Name);
                    entry. ExtractToFile(nimi);
                }

}
        }
    }
}

Original Comments

Feedback Bot on 9/20/2020, 11:00 PM:

We have directed your feedback to the appropriate engineering team for further evaluation. The team will review the feedback and notify you about the next steps.


Original Solutions

Tarek Mahmoud Sayed [MSFT] solved on 9/30/2020, 01:44 PM, 0 votes:

Thanks for sending the issue. I have tried it the same thing and I was not able to reproduce the issue. My guess here is the problem is not really the ZipArchiveEntry.Name content but it could be the Visual Studio locals Window displaying the string differently and this can be depending on the configuration of your machine. Usually I am seeing this maybe depending on the default codepage on your system.

Here is the code I tried:

        using (ZipArchive paketti = ZipFile.OpenRead(@"F:\temp\release.zip"))
        {
            foreach (ZipArchiveEntry entry in paketti.Entries)
            {
                Console.WriteLine($"{entry.Name} ... {DumpString(entry.Name)}");
            }
        }

And this printed the output:

  tämä.txt ... \u0074\u00e4\u006d\u00e4\u002e\u0074\u0078\u0074

which is correct.

I suggest you can try the same either manually printing the characters ordinal values and check your system configuration like default locale and codepage in the system.

on 10/8/2020, 05:05 AM:

(private comment, text removed)

Tarek Mahmoud Sayed [MSFT] on 10/8/2020, 10:10 AM:

Thanks for your reply. Unfortunately I cannot reach the files that you have attached. Could you please try to upload them again and let me know.

Also, please attach the code that you used to create the zip file again to see how did you include the file tämä.txt too.

on 10/9/2020, 02:29 AM:

(private comment, text removed)

Tarek Mahmoud Sayed [MSFT] on 10/9/2020, 11:01 AM:

Thanks for the details. I'll take a look.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 2
  • Comments: 17 (4 by maintainers)

Most upvoted comments

I also had ZIP entry encoding problems recently and this is what I figured out:

  • Archives created by .NET itself can be extracted without specifying an encoding just fine, so the automatically detected encoding is correct for those.
  • Archives created by the Windows shell extension or by 7-Zip (on Windows) can not correctly be extracted with .NET without specifying an encoding manually. More information here: https://stackoverflow.com/a/32443735/631802
  • Archives created on macOS also can not be correctly extracted with .NET. This is also true if the Windows workaround is applied. This means that it is not possible to extract both Windows and macOS archives with the same encoding workaround.

I think the runtime should handle all the encoding detection work.

I have looked at this issue and I am seeing the Zip archive entry names encoding is not handled correctly. When creating a new archive file there will be a generic flag telling if the archive is encoded using UTF-8 or not. if this flag is off, means the archive is not encoded using UTF-8, we don’t handle the right encoding at that time. the following comment has more details:

https://github.com/dotnet/runtime/blob/b75f8d9fd1b3bcac5d82469bf39f1aea2c3a1652/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L356

Usually when UTF-8 encoding is not used, we should consider using the 437 encoding instead. Or try to investigate more the details as described in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT in Appendix D.

This issue is kind of obvious to the users because when using Windows shell to create a new Archive, it doesn’t create the archive using UTF-8 and I am seeing it is using encoding 437.