runtime: GZipStream regression - Much worse compression for sparse data when specifying CompressionLevel.Optimal

Expected behaviour

To produce identical compressed result for the same original byte array on .NET Framework and .NET Core.

Actual behaviour

Results of GZipStream on .NET Framework and .NET Core are different.

Repro

  1. An array of 1500 characters ‘A’
  2. Compressed using GZipStream
  3. Verified results on .NET Framework (4.6.1) and .NET Core (2.1) that should be identical, but they are not.

Code available here.

Results

Platform Compressed result in hexa
.NET Framework
4.6.1
1f8b080000000000040073741c05a360148c825130dc00004d6f6cebdc050000
.NET Core 2.2.100
Commit b9f2fa0ca8

Host 2.2.0 Commit 1249f08fed
1f8b080000000000000b73741c05a321301a02a321301a028ec30c00004d6f6cebdc050000

Executed on Windows 10 Pro 64bit 1809 build 17763.134

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 38 (25 by maintainers)

Most upvoted comments

Given that nothing seems to be wrong, I would not be surprised if we won’t treat it as a bug.

We don’t expect/require that the compressed data will match across runtimes/versions, as long as it’s valid. But there does appear to be a size regression here.

using System;
using System.IO;
using System.IO.Compression;

class Program
{
    static void Main()
    {
        byte[] data = new byte[1024*1024];
        for (int i = 0; i < data.Length; i++) data[i] = (byte)'a';

        Compress(data, CompressionLevel.NoCompression);
        Compress(data, CompressionLevel.Optimal);
        Compress(data, CompressionLevel.Fastest);
    }

    static void Compress(byte[] data, CompressionLevel level)
    {
        using (MemoryStream ms = new MemoryStream())
        {
            using (GZipStream gz = new GZipStream(ms, level, leaveOpen: true))
            {
                gz.Write(data, 0, data.Length);
            }
            Console.WriteLine(level + ":\t" + ms.Length);
        }
    }
}

results in the following for me on Windows:

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f net472
NoCompression: 1048759
Optimal: 1052
Fastest: 4609

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f netcoreapp3.0
NoCompression:  1048759
Optimal:        4590
Fastest:        4609

Note that the values for NoCompression and Fastest are the same, but the values for Optimal differ by 4x. Since Optimal is supposed to mean best compression ratio, there’s obviously an issue when it’s possible to be 4x smaller.

So I happened to look at this a little while investigating another issue and have some information to share. For input data that represents “normal” data Optimal is providing better compression than Fastest. I was testing using the XML versions of the ECMA specs (334/335) and saw a significant difference between the two settings on par with what we got in Desktop, so at least in this case there is no regression. For input data that is very regular (either 0, or a, or a repeating pattern) I’m seeing that Optimal and Fastest are both doing a similar job at compressing and Optimal is worse than desktop. We shouldn’t discard this scenario since it is rather common to have sparse data and this regression impacts that as well. If I inject enough sparse data into my normal case I can reproduce similar differences with desktop as the regular case.

I went back to see where this regression was introduced and it appears to be between .NETCore 1.0 and 1.1. It’s isolated to changes to CLRCompression.dll in that release (downgrade the 1.1 version with 1.0 will undo the regression). https://github.com/dotnet/corefx/commits/release/1.1.0/src/Native/Windows/clrcompression

So I would further scope this issue to be: CompressionLevel.Optimal does worse with sparse data than desktop and netcoreapp1.0.

There are only a couple changes in the 1.0/1.1 window so I plan to selectively revert them to experiment if I can identify the regression.

Updated the data in place with sorting requested.

@ericstj Thanks for sharing this data. Could you please add another column for original size and sort the table by size?

Sure, I was planning on getting some data comparing deflate_slow to deflate_medium. I can replay the 6 vs 9 comparison as well (using only deflate_slow of course, since the current 6 maps to deflate_medium).

So I’ll have data for the following scenarios:

  • 6 : deflate_medium
  • 6 : deflate_slow
  • 9 : deflate_slow

Yes, removing USE_MEDIUM does improve the compression, but that flag existed back in 1.0 as well and we don’t have this problem. https://github.com/dotnet/corefx/blob/8209c8cc3a642c24b0c6b0b26a4ade94038463fa/src/Native/Windows/clrcompression/zlib-intel/zutil.h#L142

I believe we would have been hitting that same define in 1.0. I’ll go back and do some more forensics to see if the pre-processor evaluation changed.