runtime: GZipStream regression - Much worse compression for sparse data when specifying CompressionLevel.Optimal
Expected behaviour
To produce identical compressed result for the same original byte array on .NET Framework and .NET Core.
Actual behaviour
Results of GZipStream
on .NET Framework and .NET Core are different.
Repro
- An array of 1500 characters ‘A’
- Compressed using
GZipStream
- Verified results on .NET Framework (4.6.1) and .NET Core (2.1) that should be identical, but they are not.
Code available here.
Results
Platform | Compressed result in hexa |
---|---|
.NET Framework 4.6.1 |
1f8b080000000000040073741c05a360148c825130dc00004d6f6cebdc050000 |
.NET Core 2.2.100 Commit b9f2fa0ca8 Host 2.2.0 Commit 1249f08fed |
1f8b080000000000000b73741c05a321301a02a321301a028ec30c00004d6f6cebdc050000 |
Executed on Windows 10 Pro 64bit 1809 build 17763.134
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 38 (25 by maintainers)
We don’t expect/require that the compressed data will match across runtimes/versions, as long as it’s valid. But there does appear to be a size regression here.
results in the following for me on Windows:
Note that the values for NoCompression and Fastest are the same, but the values for Optimal differ by 4x. Since Optimal is supposed to mean best compression ratio, there’s obviously an issue when it’s possible to be 4x smaller.
So I happened to look at this a little while investigating another issue and have some information to share. For input data that represents “normal” data Optimal is providing better compression than Fastest. I was testing using the XML versions of the ECMA specs (334/335) and saw a significant difference between the two settings on par with what we got in Desktop, so at least in this case there is no regression. For input data that is very regular (either 0, or
a
, or a repeating pattern) I’m seeing that Optimal and Fastest are both doing a similar job at compressing and Optimal is worse than desktop. We shouldn’t discard this scenario since it is rather common to have sparse data and this regression impacts that as well. If I inject enough sparse data into my normal case I can reproduce similar differences with desktop as the regular case.I went back to see where this regression was introduced and it appears to be between .NETCore 1.0 and 1.1. It’s isolated to changes to CLRCompression.dll in that release (downgrade the 1.1 version with 1.0 will undo the regression). https://github.com/dotnet/corefx/commits/release/1.1.0/src/Native/Windows/clrcompression
So I would further scope this issue to be: CompressionLevel.Optimal does worse with sparse data than desktop and netcoreapp1.0.
There are only a couple changes in the 1.0/1.1 window so I plan to selectively revert them to experiment if I can identify the regression.
Updated the data in place with sorting requested.
@ericstj Thanks for sharing this data. Could you please add another column for original size and sort the table by size?
Sure, I was planning on getting some data comparing deflate_slow to deflate_medium. I can replay the 6 vs 9 comparison as well (using only deflate_slow of course, since the current 6 maps to deflate_medium).
So I’ll have data for the following scenarios:
Yes, removing USE_MEDIUM does improve the compression, but that flag existed back in 1.0 as well and we don’t have this problem. https://github.com/dotnet/corefx/blob/8209c8cc3a642c24b0c6b0b26a4ade94038463fa/src/Native/Windows/clrcompression/zlib-intel/zutil.h#L142
I believe we would have been hitting that same define in 1.0. I’ll go back and do some more forensics to see if the pre-processor evaluation changed.