pako: Failing for BGZIP'd streaming files
Hi all, thanks for the wonderful library!
Unfortunately I think I’ve found a bug. Files compressed with bgzip
(block gzip) are failing when trying to use pako
to do streaming decompression.
The file pako-fail-test-data.txt.gz is an example file that is able to trigger what I believe to be an error. The file itself is 65,569 bytes big, which is just larger than what I assume to be a block size relevant to bgzip (somewhere around 65280?). Here is a small shell session that has some relevant information:
$ wc pako-fail-test-data.txt
1858 16831 65569 pako-fail-test-data.txt
$ md5sum pako-fail-test-data.txt
7eae4c6bc0e68326879728f80a0e002b pako-fail-test-data.txt
$ zcat pako-fail-test-data.gz | bgzip -c > pako-fail-test-data.txt.gz
$ md5sum pako-fail-test-data.txt.gz
f4d0b896c191f66ff6962de37d69db45 pako-fail-test-data.txt.gz
$ bgzip -h
Version: 1.4.1
Usage: bgzip [OPTIONS] [FILE] ...
Options:
-b, --offset INT decompress at virtual file pointer (0-based uncompressed offset)
-c, --stdout write on standard output, keep original files unchanged
-d, --decompress decompress
-f, --force overwrite files without asking
-h, --help give this help
-i, --index compress and create BGZF index
-I, --index-name FILE name of BGZF index file [file.gz.gzi]
-r, --reindex (re)index compressed file
-g, --rebgzip use an index file to bgzip a file
-s, --size INT decompress INT bytes (uncompressed size)
-@, --threads INT number of compression threads to use [1]
Here is some sample code that should decompress the whole file, but doesn’t. My apologies for it not being elegant, I’m still learning and I kind of threw a bunch of things together to get something that I believe triggers the error:
var pako = require("pako"),
fs = require("fs");
var CHUNK_SIZE = 1024*1024,
buffer = new Buffer(CHUNK_SIZE);
function _node_uint8array_to_string(data) {
var buf = new Buffer(data.length);
for (var ii=0; ii<data.length; ii++) {
buf[ii] = data[ii];
}
return buf.toString();
}
var inflator = new pako.Inflate();
inflator.onData = function(chunk) {
var v = _node_uint8array_to_string(chunk);
process.stdout.write(v);
};
fs.open("./pako-fail-test-data.txt.gz", "r", function(err,fd) {
if (err) { throw err; }
function read_chunk() {
fs.read(fd, buffer, 0, CHUNK_SIZE, null,
function(err, nread) {
var data = buffer;
if (nread<CHUNK_SIZE) { data = buffer.slice(0, nread); }
inflator.push(data, false);
if (nread > 0) { read_chunk(); }
});
};
read_chunk();
});
I did not indicate an end block (that is I did not do inflator.push(data.false)
anywhere) and there are maybe other problems with this in how the data blocks are read from fs
but I hope you’ll forgive this sloppiness in the interest of simplicity to illuminate the relevant issue.
Running this does successfully decompress a portion of the file but then stops at what I believe to the first block. Here are some shell commands that might be enlightening:
$ node pako-error-example.js | wc
1849 16755 65280
$ node pako-error-example.js | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8 -
$ zcat pako-fail-test-data.txt.gz | md5sum
7eae4c6bc0e68326879728f80a0e002b -
$ zcat pako-fail-test-data.txt.gz | head -c 65280 | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8 -
$ zcat pako-fail-test-data.txt.gz | wc
1858 16831 65569
Running another simple example using browser-zlib
triggers an error outright:
var fs = require("fs"),
zlib = require("browserify-zlib");
var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();
z.on("data", function(chunk) {
process.stdout.write(chunk.toString());
});
r.pipe(z);
And when run via node stream-example-2.js
, the error produces is:
events.js:137
throw er; // Unhandled 'error' event
^
Error: invalid distance too far back
at Zlib._handle.onerror (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/index.js:352:17)
at Zlib._error (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:283:8)
at Zlib._checkError (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:254:12)
at Zlib._after (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:262:13)
at /home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:126:10
at process._tickCallback (internal/process/next_tick.js:150:11)
I assume this is a pako
error as browserify-zlib
uses pako
underneath so my apologies if this is browserify-zlib
error and has nothing to do with pako
.
As a “control”, the following code works without issue:
var fs = require("fs"),
zlib = require("zlib");
var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();
z.on("data", function(chunk) {
process.stdout.write(chunk.toString());
});
r.pipe(z);
bgzip
is used to allow for random access to gzipped files. The resulting block compressed file is a bit bigger than using straight gzip
compression but the small compressed file size inflation is often worth it for the ability to efficiently access arbitrary positions in the uncompressed data.
My specific use case is that I want to process a large text file (compressed ~115Mb
, ~650Mb
uncompressed with other files being even larger). Loading the complete file, either compressed or uncompressed, is not an option either because of memory exhaustion or straight up memory restrictions in JavaScript. I only need to process the data in a streaming manner (that is, I only need to look at the data once and then am able to mostly discard it) so this is why I was looking into this option. The bioinformatics community uses this method quite a bit (bgzip
is itself part of tabix
which is part of a bioinformatics library called htslib
) so it would be nice if pako
supported this use case.
If there is another library I should be using to allow for stream processing of compressed data in the browser, I would welcome any suggestions.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 35 (14 by maintainers)
@drtconway wrapper changed significantly but multistream test exist https://github.com/nodeca/pako/blob/0398fad238edc29df44f78e338cbcfd5ee2657d3/test/gzip_specials.js#L60-L77
@rbuels I’m working on reading local fastq.gz files in the browser and stumbled upon this issue. Haven’t been able to get pako to work so far. Is there currently a working solution for streaming bgzf files in the browser?
EDIT: I need streaming because the files are large. I don’t (and can’t) need to store the entire file in memory, just need to stream through all the lines to gather some statistics.
For the bioinformaticians in the thread, just going to say that I ended up coding around this issue and eventually releasing https://github.com/GMOD/bgzf-filehandle and https://github.com/GMOD/tabix-js for accessing BGZIP and tabix files, respectively.
On Fri, Mar 1, 2019 at 12:37 PM Abe notifications@github.com wrote:
We’re seeing this issue quite a bit (I wonder if bioinformaticians just really like gzipping in blocks!)
Making the
inflateReset
into aninflateResetKeep
call fixes the “invalid distance too far back” error, but results in the chunk remainder being written into the out buffer (which is bad). This can be fixed by moving thestatus === c.Z_STREAM_END
condition for thestrm.next_out
write branch as in #146 (and I think that’s the right thing to do?) The “Read stream with SYNC marks” test still fails though and I don’t quite understand why either.I put these changes up at https://github.com/onecodex/pako and I’m happy to restructure or make a PR if that’s helpful (thanks for Kirill89’s
issue-139
branch and rbuels’ #145 PR for providing 99% of the work here).@rbuels look at this lines:
https://github.com/nodeca/pako/blob/c60b97e22239c02c0b5a112abbd6c6a9b5d86b45/lib/inflate.js#L279-L281
It seems, this should be removed, because Z_STREAM_END is processed inside loop and should not finalize deflate. But after removing that line, one test fails, and that’s the main reason why @Kirill89 's commit was postponed.
I checked the same file on original zlib code and found the same behavior (
inflate
returns Z_STREAM_END too early).Also I found very interesting implementation of wrapper for
inflate
function. According to these implementation we must doinflateReset
on everyZ_STREAM_END
instead of terminating.This is possible fix: https://github.com/nodeca/pako/commit/c60b97e22239c02c0b5a112abbd6c6a9b5d86b45
After that fix one test becomes broken, but I don’t understand why (need your help to solve).
Code to reproduce same behavior:
Got it (with simple example from last post). I think, problem is here https://github.com/nodeca/pako/blob/893381abcafa10fa2081ce60dae7d4d8e873a658/lib/inflate.js#L273
Let me explain. Pako consists of 2 parts:
When we implemented wrappers, we could not find what to do if stream consists of multiple parts (probably returns multiple Z_STREAM_END). That’s not widely used mode.
/cc @Kirill89 could you take a look?