node-archiver: unexpected behavior using append()

In the following example:

const fs = require('fs')
const { spawn } = require('child_process');
const { Readable } = require('stream');
const archiver = require('archiver');

const archive = archiver('zip');

archive.pipe(fs.createWriteStream('test.zip'));

for ( let i = 0; i < 10; ++i ) {

	let child = spawn('dir', [], {
		shell: true,
		stdio: ['ignore', 'pipe', 'ignore'],
	});

	archive.append(child.stdout, { name: 'test'+i+'.txt' });
}

archive.finalize();

The file test.zip contains 10 files and only the first one has content (other are 0-size)

archiver v3.0.0 nodejs v11.3.0 os: windows 7

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 7
  • Comments: 26

Most upvoted comments

For anyone having the same problem and having trouble finding the workaround, the short answer is:

const { PassThrough } = require('stream')
archive.append(file.pipe(new PassThrough()), { name })

I tested this with network requests and child processes, seems to work every time.

Internally, archiver uses an async queue to queue up files to be added to the archive. Due to zip’s streaming nature, the files must be encoded one at a time, which means queued files will not be processed until previous files have finished encoding. The async queue does that for you, but for node streams there’s a catch, if your streams are already flowing when added to the queue, backpressure must be managed, which in the case of long running tasks like these, mean pausing the stream for potentially minutes while a file downloads. The PassThough() stream (and all node native streams) are quite smart and will do that automatically for you, but they can only signal to the producer that they need to pause, they can’t force the producer to stop producing chunks, and their internal in-memory buffer is very limited. If the http request keep sending chunks after the buffer is full, those bytes will be lost.

Given this scenario, archiver’s internal queue becomes inappropriate, and the best solution I’ve found so far is implementing my own queuing for these jobs. It’s quite simple, though. Internally, archiver uses a lower level module called zip-stream. It does the exact same thing archiver does, except for the queuing part. You can manually queue your http requests to be started only after the previous file finishes encoding, this way creating the producer streams only when the consumer stream is ready to start consuming them. I usually queue with a sequence of chained promises, but any mechanism will do.

I’ve just setup a repository with my experiments in breaking archiver. There you’ll find a test-file with a scenario using zip-stream to download gigabytes of files in a sequence of promises. This is the only test that passes with a list of such big files.

Sorry for the overly long text. I hope that helps anyone with the same problem.

I noticed that when files get truncated more than one entry event has fired between append calls. I tried waiting on the entry event before appending the next file. This fixed most of the files but it is still possible in some cases that now the first file gets truncated. Instead I tried buffering the stdout stream through a passthrough stream which seems to work around the issue.

const fs = require('fs')
const child_process = require('child_process');
const { PassThrough } = require('stream');
const archiver = require('archiver');

issue_364();

async function issue_364() {
   await verify(wait_entry);
   await verify(pass_through);
   await verify(append_stream);
}

async function verify(generate_archive) {
   console.log(generate_archive.name);
   const archive = archiver('zip');
   const out = fs.createWriteStream('test.zip');
   archive.pipe(out);

   let done = new Promise(resolve => {
      out.on('finish', () => {
         verify_files();
         resolve();
      });
   });
   await generate_archive(archive);  
   archive.finalize();
   await done;
}

// fails all but first file
function append_stream(archive) {
    for ( let i = 0; i < 10; ++i ) {
      let child = spawn();
      append(child.stdout, i, archive);
   }
}   

// all files pass
function pass_through(archive) {
    for ( let i = 0; i < 10; ++i ) {
      let pass = new PassThrough();
      let child = spawn();
      child.stdout.pipe(pass);
      append(pass, i, archive);
   }
}

// the first file still gets truncated if this is run first
// if this is run after another test it will pass
async function wait_entry(archive) {
    for ( let i = 0; i < 10; ++i ) {
      let child = spawn();
      let entry = new Promise(resolve => {
         archive.on('entry', resolve);
      });
      append(child.stdout, i, archive);
      await entry;
   }
}          

// 13 bytes on stdout
function spawn() {
   return child_process.spawn('echo', ['hello world!']);
}

function append(stream, index, archive) {
   archive.append(stream, { name: `foo/test${index}.txt` });
}

function verify_files() {
   child_process.execSync('unzip test.zip');
   for(let file of fs.readdirSync('foo')) {
      let size = fs.statSync(`foo/${file}`).size;
      if(size != 13) {
         console.log(`${file} expected 13 bytes got ${size}`);
      }
   }
   child_process.execSync('rm -r foo test.zip');
} 

Same trouble here, on linux fedora 29, nodejs v10, archiver v3 and when using an HTTP stream (IncomingMessage).

The content is randomly truncated, in a totally undeterministic way, i.e: the very same streams in the exact same order may succeed 10% of time, and fails 90% of time with half of the file truncated or even with a length of zero bytes.

I have tried to append a stream only when the previous one has finished, it doesn’t seems to improve the result.

Any news on this?

@zhujun24 This issue does not apply to your example, since you are appending a Buffer, and the problem only happens when appending a Stream.

To anyone interested, the fix for this issue was already accepted on upstream archiver-utils with the merged PR 17. Hopefully it’ll be in the next release of archiver.

@raghuchahar007 That part actually is about the transparent gzipping that the http protocol supports when you send a header Accept-Encoding: gzip, and the webserver gzips any payload before sending through the wire. It’s usually for text files, so you probably don’t need that.

Thanks, @jntesteves, Yes, I was also searching for some end/finish/close event for the completion of the archiver’s append(not finalize) method so that I can queue my files accordingly but didn’t find any. It would be good if the archiver itself provides some flexibility over its append method so that we can wait for the previous file before adding others.

As per your suggestion regarding using zip-stream directly, I checked your code and modified it as per my requirements and now it seems working fine.

Here is new sample:

const packer = require("zip-stream");
const fs = require("fs");
const https = require("https");
const http = require("http");

const archive = new packer();

const output = fs.createWriteStream("./ZipTest.zip");
archive.pipe(output);

const urls = [
  {
    fileName: "1.png",
    url: "https://homepages.cae.wisc.edu/~ece533/images/airplane.png"
  },
  {
    fileName: "2.png",
    url: "https://homepages.cae.wisc.edu/~ece533/images/boat.png"
  }
];

//urls must be https
const handleEntries = elem => {
  return new Promise((resolve, reject) => {
    const fileName = elem.fileName;
    const url = elem.url;
    console.log("Downloading : ", fileName);
    https.get(url, data => {
      archive.entry(data, { name: fileName }, (error, result) => {
        if (!error) {
          console.log(`File : ${fileName} appended.`);
          resolve(result);
        } else {
          console.error(`Error appending file : ${fileName} url : ${url}.`);
          reject(error);
        }
      });
      data.on("close", () => {
        console.log("Int : closed ,", fileName);
      });
    });
  });
};

const testZip = async () => {
  for (const elem of urls) {
    await handleEntries(elem);
  }
  archive.finish();
};

testZip();

Please correct me if anything is wrong here. Also why we need zlib.createGunzip() before piping?

I am having a similar issue but for bigger files, sample code:

const urls = [
  { fileName: "abc", url: "https://abcd.com/q?abc.mp4" },
  { fileName: "xyz", url: "https://abcd.com/q?xyz.mp4" }
];
while (urls.length) {
  await Promise.all(
    urls.splice(0, promiseAllLimit).map(elem => {
      return new Promise((resolve, reject) => {
        const fileName = elem.fileName;
        const url = elem.url;
        if (url) {
          https.get(url, data => {
            archive.append(data, { name: fileName });
            data.on("close", () => {
              console.log("Closed : ", fileName);
              resolve("done");
            });
          });
        }
      });
    })
  );
}
archive.finalize();

This code works fine for small size files but whenever I try to zip large size files it does not seems to work. let say I try archiving 3 500MB files each, then after archiving is complete, out of three only 1 file is complete and the other two are in KBs.It seems like the finalize() is finishing before completely reading the read stream.

Also If I am using buffer instead of streams while appending into Archiver it works fine. But the buffer takes memory and if the file is very large it will take lots of memory which is not the best option. Please Help!