node-archiver: unexpected behavior using append()
In the following example:
const fs = require('fs')
const { spawn } = require('child_process');
const { Readable } = require('stream');
const archiver = require('archiver');
const archive = archiver('zip');
archive.pipe(fs.createWriteStream('test.zip'));
for ( let i = 0; i < 10; ++i ) {
let child = spawn('dir', [], {
shell: true,
stdio: ['ignore', 'pipe', 'ignore'],
});
archive.append(child.stdout, { name: 'test'+i+'.txt' });
}
archive.finalize();
The file test.zip
contains 10 files and only the first one has content (other are 0-size)
archiver v3.0.0 nodejs v11.3.0 os: windows 7
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 7
- Comments: 26
For anyone having the same problem and having trouble finding the workaround, the short answer is:
I tested this with network requests and child processes, seems to work every time.
Internally, archiver uses an async queue to queue up files to be added to the archive. Due to zip’s streaming nature, the files must be encoded one at a time, which means queued files will not be processed until previous files have finished encoding. The async queue does that for you, but for node streams there’s a catch, if your streams are already flowing when added to the queue, backpressure must be managed, which in the case of long running tasks like these, mean pausing the stream for potentially minutes while a file downloads. The
PassThough()
stream (and all node native streams) are quite smart and will do that automatically for you, but they can only signal to the producer that they need to pause, they can’t force the producer to stop producing chunks, and their internal in-memory buffer is very limited. If the http request keep sending chunks after the buffer is full, those bytes will be lost.Given this scenario, archiver’s internal queue becomes inappropriate, and the best solution I’ve found so far is implementing my own queuing for these jobs. It’s quite simple, though. Internally, archiver uses a lower level module called zip-stream. It does the exact same thing archiver does, except for the queuing part. You can manually queue your http requests to be started only after the previous file finishes encoding, this way creating the producer streams only when the consumer stream is ready to start consuming them. I usually queue with a sequence of chained promises, but any mechanism will do.
I’ve just setup a repository with my experiments in breaking archiver. There you’ll find a test-file with a scenario using
zip-stream
to download gigabytes of files in a sequence of promises. This is the only test that passes with a list of such big files.Sorry for the overly long text. I hope that helps anyone with the same problem.
I noticed that when files get truncated more than one entry event has fired between append calls. I tried waiting on the entry event before appending the next file. This fixed most of the files but it is still possible in some cases that now the first file gets truncated. Instead I tried buffering the stdout stream through a passthrough stream which seems to work around the issue.
Same trouble here, on linux fedora 29, nodejs v10, archiver v3 and when using an HTTP stream (IncomingMessage).
The content is randomly truncated, in a totally undeterministic way, i.e: the very same streams in the exact same order may succeed 10% of time, and fails 90% of time with half of the file truncated or even with a length of zero bytes.
I have tried to append a stream only when the previous one has finished, it doesn’t seems to improve the result.
Any news on this?
@zhujun24 This issue does not apply to your example, since you are appending a Buffer, and the problem only happens when appending a Stream.
To anyone interested, the fix for this issue was already accepted on upstream archiver-utils with the merged PR 17. Hopefully it’ll be in the next release of archiver.
@raghuchahar007 That part actually is about the transparent gzipping that the http protocol supports when you send a header
Accept-Encoding: gzip
, and the webserver gzips any payload before sending through the wire. It’s usually for text files, so you probably don’t need that.Thanks, @jntesteves, Yes, I was also searching for some end/finish/close event for the completion of the archiver’s append(not finalize) method so that I can queue my files accordingly but didn’t find any. It would be good if the archiver itself provides some flexibility over its append method so that we can wait for the previous file before adding others.
As per your suggestion regarding using zip-stream directly, I checked your code and modified it as per my requirements and now it seems working fine.
Here is new sample:
Please correct me if anything is wrong here. Also why we need
zlib.createGunzip()
before piping?I am having a similar issue but for bigger files, sample code:
This code works fine for small size files but whenever I try to zip large size files it does not seems to work. let say I try archiving 3 500MB files each, then after archiving is complete, out of three only 1 file is complete and the other two are in KBs.It seems like the finalize() is finishing before completely reading the read stream.
Also If I am using buffer instead of streams while appending into Archiver it works fine. But the buffer takes memory and if the file is very large it will take lots of memory which is not the best option. Please Help!