numcodecs: numcodecs.Gzip can't read files written with zlib in the gzip format
The numcodecs.Gzip
codec can’t read files that were produced by using the zlib
c-api to obtain a gzip
compressed file. It fails with
File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'x^')
This is due to the fact that numcodecs.Gzip
uses the python gzip
library to read and write files. This library uses zlib
internally, however it adds additional bytes to the header and it expects these to be present when reading.
These bytes are not present when producing a gzip
stream via the zlib
c-api:
deflateInit2(&zs, compressionLevel,
Z_DEFLATED, MAX_WBITS + 16,
MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY)
They are rather part of the gzip
file format produced by the unix gzip
command.
I would propose to not use python gzip
, but rather use python zlib
and use it for compression and decompression to gzip
compatible format.
Note that this should be backward compatible, because zlib
can read files written by unix gzip
.
I have only tested this for the zlib
c-api, not for python, yet.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 23 (11 by maintainers)
I went ahead and created a repo to collect zarr / n5 data written by zarr and z5py: https://github.com/constantinpape/zarr_implementations Ofc, this can be extended for implementations in other languages.
If you want, we can transfer ownership to zarr-developers (I think I need to become a member to do this). I am also open to any changes you suggest.
Btw, I already profited from this because I found and fixed an issue with zarr edge chunks in z5py. I will look further into gzip in the coming days.
@jakirkham I had hoped to discuss this on the call today, but we ran out of time before that. I would vote for creating an extra repo for zarr data written by different implementations, which could be extended by inter-operability tests later. If you open this in zarr-developers, we can raise an issue about which example data to use and I can make a PR adding the z5py data.