runtime: System.IO.Compression: ZipArchive does not respect CP437 encoding when reading files

The ZipArchive class uses ASCII encoding instead of IBM 437 (CP 437) (since dotnet/corefx#9004) encoding when reading and processing file names of zip entries which do not have the language encoding flag (bit 11 in the general purpose bit flag) set.

The zip file specification appendix D specifies how encoding should be handled:

The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change.

If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification.

This creates a problem mainly for European languages: they tend to use a lot of the characters which are present in CP437 but not in ASCII, so you may get inconsistent behavior there.

Windows 10 still defaults to CP437 where possible: you can try this by creating a file called Über.txt and zipping it using File Explorer. The filename will be encoded using CP437; but the ZipArchive class decodes it as �ber.txt.

Because the zip archive re-creates the individual zip entries when the file is saved in Update mode, this results in data loss when opening & closing the archive, even if that entry is not modified.

This appears to be a side-effect of removing CP437 support from .NET Core, although it is still available in the System.Text.Encoding.Codepages package (see dotnet/runtime#17849 for more information).

A simple fix would be to have the System.IO.Compression package take a dependency on System.Text.Encoding.Codepages, but not sure if this is desirable.

/cc @ianhays

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

Cool. @ianhays will be able to provide further guidance (I don’t know the answer myself, sorry).

I sent you repo Collaborator invite. Once you accept, we can assign it to you. Assigning to @ianhays for now.

Is there preprocessor symbol for distinguishing between coreclr and full framework? I’m trying to distinguish between the CodePages package and the CP437 support in full framework.

You could use FEATURE_ZLIB or add your own to be more specific. If the code ends up being especially complex then I would suggest splitting the file, make the class partial, and conditionally including based on architecture e.g. ZipArchive.net46 and ZipArchive.Windows.