numcodecs: numcodecs.Gzip can't read files written with zlib in the gzip format

The numcodecs.Gzip codec can’t read files that were produced by using the zlib c-api to obtain a gzip compressed file. It fails with

  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'x^')

This is due to the fact that numcodecs.Gzip uses the python gzip library to read and write files. This library uses zlib internally, however it adds additional bytes to the header and it expects these to be present when reading. These bytes are not present when producing a gzip stream via the zlib c-api:

deflateInit2(&zs, compressionLevel,
                   Z_DEFLATED, MAX_WBITS + 16,
                   MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY)

They are rather part of the gzip file format produced by the unix gzip command.

I would propose to not use python gzip, but rather use python zlib and use it for compression and decompression to gzip compatible format.

Note that this should be backward compatible, because zlib can read files written by unix gzip. I have only tested this for the zlib c-api, not for python, yet.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (11 by maintainers)

Most upvoted comments

I went ahead and created a repo to collect zarr / n5 data written by zarr and z5py: https://github.com/constantinpape/zarr_implementations Ofc, this can be extended for implementations in other languages.

If you want, we can transfer ownership to zarr-developers (I think I need to become a member to do this). I am also open to any changes you suggest.

Btw, I already profited from this because I found and fixed an issue with zarr edge chunks in z5py. I will look further into gzip in the coming days.

@jakirkham I had hoped to discuss this on the call today, but we ran out of time before that. I would vote for creating an extra repo for zarr data written by different implementations, which could be extended by inter-operability tests later. If you open this in zarr-developers, we can raise an issue about which example data to use and I can make a PR adding the z5py data.