pako: Failing for BGZIP'd streaming files

Hi all, thanks for the wonderful library!

Unfortunately I think I’ve found a bug. Files compressed with bgzip (block gzip) are failing when trying to use pako to do streaming decompression.

The file pako-fail-test-data.txt.gz is an example file that is able to trigger what I believe to be an error. The file itself is 65,569 bytes big, which is just larger than what I assume to be a block size relevant to bgzip (somewhere around 65280?). Here is a small shell session that has some relevant information:

$ wc pako-fail-test-data.txt 
 1858 16831 65569 pako-fail-test-data.txt
$ md5sum pako-fail-test-data.txt 
7eae4c6bc0e68326879728f80a0e002b  pako-fail-test-data.txt
$ zcat pako-fail-test-data.gz | bgzip -c > pako-fail-test-data.txt.gz
$ md5sum pako-fail-test-data.txt.gz 
f4d0b896c191f66ff6962de37d69db45  pako-fail-test-data.txt.gz
$ bgzip -h

Version: 1.4.1
Usage:   bgzip [OPTIONS] [FILE] ...
Options:
   -b, --offset INT        decompress at virtual file pointer (0-based uncompressed offset)
   -c, --stdout            write on standard output, keep original files unchanged
   -d, --decompress        decompress
   -f, --force             overwrite files without asking
   -h, --help              give this help
   -i, --index             compress and create BGZF index
   -I, --index-name FILE   name of BGZF index file [file.gz.gzi]
   -r, --reindex           (re)index compressed file
   -g, --rebgzip           use an index file to bgzip a file
   -s, --size INT          decompress INT bytes (uncompressed size)
   -@, --threads INT       number of compression threads to use [1]

Here is some sample code that should decompress the whole file, but doesn’t. My apologies for it not being elegant, I’m still learning and I kind of threw a bunch of things together to get something that I believe triggers the error:

var pako = require("pako"),
    fs = require("fs");

var CHUNK_SIZE = 1024*1024,
    buffer = new Buffer(CHUNK_SIZE);

function _node_uint8array_to_string(data) {
  var buf = new Buffer(data.length);
  for (var ii=0; ii<data.length; ii++) {
    buf[ii] = data[ii];
  }
  return buf.toString();
}

var inflator = new pako.Inflate();
inflator.onData = function(chunk) {
  var v = _node_uint8array_to_string(chunk);
  process.stdout.write(v);
};

fs.open("./pako-fail-test-data.txt.gz", "r", function(err,fd) {
  if (err) { throw err; }
  function read_chunk() {
    fs.read(fd, buffer, 0, CHUNK_SIZE, null,
      function(err, nread) {
        var data = buffer;
        if (nread<CHUNK_SIZE) { data = buffer.slice(0, nread); }
        inflator.push(data, false);
        if (nread > 0) { read_chunk(); }
      });
  };
  read_chunk();
});

I did not indicate an end block (that is I did not do inflator.push(data.false) anywhere) and there are maybe other problems with this in how the data blocks are read from fs but I hope you’ll forgive this sloppiness in the interest of simplicity to illuminate the relevant issue.

Running this does successfully decompress a portion of the file but then stops at what I believe to the first block. Here are some shell commands that might be enlightening:

$ node pako-error-example.js | wc
   1849   16755   65280
$ node pako-error-example.js | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8  -
$ zcat pako-fail-test-data.txt.gz | md5sum
7eae4c6bc0e68326879728f80a0e002b  -
$ zcat pako-fail-test-data.txt.gz | head -c 65280 | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8  -
$ zcat pako-fail-test-data.txt.gz | wc
   1858   16831   65569

Running another simple example using browser-zlib triggers an error outright:

var fs = require("fs"),
    zlib = require("browserify-zlib");

var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();

z.on("data", function(chunk) {
  process.stdout.write(chunk.toString());
});
r.pipe(z);

And when run via node stream-example-2.js, the error produces is:

events.js:137
      throw er; // Unhandled 'error' event
      ^

Error: invalid distance too far back
    at Zlib._handle.onerror (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/index.js:352:17)
    at Zlib._error (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:283:8)
    at Zlib._checkError (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:254:12)
    at Zlib._after (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:262:13)
    at /home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:126:10
    at process._tickCallback (internal/process/next_tick.js:150:11)

I assume this is a pako error as browserify-zlib uses pako underneath so my apologies if this is browserify-zlib error and has nothing to do with pako.

As a “control”, the following code works without issue:

var fs = require("fs"),
    zlib = require("zlib");

var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();

z.on("data", function(chunk) {
  process.stdout.write(chunk.toString());
});
r.pipe(z);

bgzip is used to allow for random access to gzipped files. The resulting block compressed file is a bit bigger than using straight gzip compression but the small compressed file size inflation is often worth it for the ability to efficiently access arbitrary positions in the uncompressed data.

My specific use case is that I want to process a large text file (compressed ~115Mb, ~650Mb uncompressed with other files being even larger). Loading the complete file, either compressed or uncompressed, is not an option either because of memory exhaustion or straight up memory restrictions in JavaScript. I only need to process the data in a streaming manner (that is, I only need to look at the data once and then am able to mostly discard it) so this is why I was looking into this option. The bioinformatics community uses this method quite a bit (bgzip is itself part of tabix which is part of a bioinformatics library called htslib) so it would be nice if pako supported this use case.

If there is another library I should be using to allow for stream processing of compressed data in the browser, I would welcome any suggestions.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 35 (14 by maintainers)

Most upvoted comments

@rbuels I’m working on reading local fastq.gz files in the browser and stumbled upon this issue. Haven’t been able to get pako to work so far. Is there currently a working solution for streaming bgzf files in the browser?

EDIT: I need streaming because the files are large. I don’t (and can’t) need to store the entire file in memory, just need to stream through all the lines to gather some statistics.

For the bioinformaticians in the thread, just going to say that I ended up coding around this issue and eventually releasing https://github.com/GMOD/bgzf-filehandle and https://github.com/GMOD/tabix-js for accessing BGZIP and tabix files, respectively.

On Fri, Mar 1, 2019 at 12:37 PM Abe notifications@github.com wrote:

@bovee https://github.com/bovee, bgzip allows for random access to large gzip files. In bioinformatics, there’s often a need to access large files efficiently and at random (from 100Mb to 5Gb or more, compressed, representing a whole genome in some format, for example). Vanilla gzip requires to decompress all previous elements before getting at some position.

By splitting the gzip file into blocks, you can create an index which can then be used to allow for efficient random access. The resulting bgzipd files are a bit bigger compressing without block (i.e. just vanilla gzip) but most of the benefits of compression are still retained while still allowing for efficient random access to the file. There’s the added benefit that a bgzipd file should look like a regular gzip file so all the “standard” tools should still work to decompress it.

Here is what I believe to be the original paper by Heng Li on Tabix https://academic.oup.com/bioinformatics/article/27/5/718/262743 (Tabix has now been subsumed into htslib if I’m not mistaken).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodeca/pako/issues/139#issuecomment-468803498, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEgFS3Op1zm7Xq4vK8RQ6z9zMnEb5k4ks5vSY-JgaJpZM4URinR .

We’re seeing this issue quite a bit (I wonder if bioinformaticians just really like gzipping in blocks!)

Making the inflateReset into an inflateResetKeep call fixes the “invalid distance too far back” error, but results in the chunk remainder being written into the out buffer (which is bad). This can be fixed by moving the status === c.Z_STREAM_END condition for the strm.next_out write branch as in #146 (and I think that’s the right thing to do?) The “Read stream with SYNC marks” test still fails though and I don’t quite understand why either.

I put these changes up at https://github.com/onecodex/pako and I’m happy to restructure or make a PR if that’s helpful (thanks for Kirill89’s issue-139 branch and rbuels’ #145 PR for providing 99% of the work here).

@rbuels look at this lines:

https://github.com/nodeca/pako/blob/c60b97e22239c02c0b5a112abbd6c6a9b5d86b45/lib/inflate.js#L279-L281

It seems, this should be removed, because Z_STREAM_END is processed inside loop and should not finalize deflate. But after removing that line, one test fails, and that’s the main reason why @Kirill89 's commit was postponed.

I checked the same file on original zlib code and found the same behavior (inflate returns Z_STREAM_END too early).

Also I found very interesting implementation of wrapper for inflate function. According to these implementation we must do inflateReset on every Z_STREAM_END instead of terminating.

This is possible fix: https://github.com/nodeca/pako/commit/c60b97e22239c02c0b5a112abbd6c6a9b5d86b45

After that fix one test becomes broken, but I don’t understand why (need your help to solve).

Code to reproduce same behavior:

    // READ FILE
    size_t file_size;
    Byte *file_buf = NULL;
    uLong buf_size;

    FILE *fp = fopen("/home/Kirill/Downloads/pako-fail-test-data.txt.gz", "rb");
    fseek(fp, 0, SEEK_END);
    file_size = ftell(fp);
    rewind(fp);
    buf_size = file_size * sizeof(*file_buf);
    file_buf = malloc(buf_size);
    fread(file_buf, file_size, 1, fp);

    // INIT ZLIB
    z_stream d_stream;
    d_stream.zalloc = Z_NULL;
    d_stream.zfree = Z_NULL;
    d_stream.opaque = (voidpf)0;

    d_stream.next_in  = file_buf;
    d_stream.avail_in = (uInt)buf_size;

    int err = inflateInit2(&d_stream, 47);
    CHECK_ERR(err, "inflateInit");

    // Inflate
    uLong chunk_szie = 5000;
    Byte* chunk = malloc(chunk_szie * sizeof(Byte));

    do {
        memset(chunk, 0, chunk_szie);
        d_stream.next_out = chunk;
        d_stream.avail_out = (uInt)chunk_szie;
        err = inflate(&d_stream, Z_NO_FLUSH);
        printf("inflate(): %s\n", (char *)chunk);
        if (err == Z_STREAM_END) {
//            inflateReset(&d_stream);
            break;
        }
    } while (d_stream.avail_in);

    err = inflateEnd(&d_stream);
    CHECK_ERR(err, "inflateEnd");

Got it (with simple example from last post). I think, problem is here https://github.com/nodeca/pako/blob/893381abcafa10fa2081ce60dae7d4d8e873a658/lib/inflate.js#L273

Let me explain. Pako consists of 2 parts:

  1. zlib port - very stable and well tested, but difficult to use directly.
  2. sugar wrappers for simple calls.

When we implemented wrappers, we could not find what to do if stream consists of multiple parts (probably returns multiple Z_STREAM_END). That’s not widely used mode.

/cc @Kirill89 could you take a look?