node-glob: Performance order of magnitude slower than native

For patterns like **/*.txt, node-glob seems to be an order of magnitude slower than native globbing.

Here is a script to demonstrate this (save it as node-glob/benchmark.sh): https://gist.github.com/joliss/7469304

Here are the timings from running this on my system:

Bash timing: 0m0.696s
Node statSync and readdirSync timing: 0m0.977s
Node glob.sync timing: 0m9.676s
Node glob async timing: 0m22.130s (fluctuating somewhat)

Based on the recursive statSync + readdirSync performance, it seems that it should be possible to make node-glob nearly as fast as native, at least for simple patterns.

This is an actual issue for me, with sets of only a few hundred files: In a JavaScript build tool I’m writing, the execution time of the most performance-critical code path is currently bottlenecked on node-glob. I know I could avoid using node-glob if I wanted to, but I’d love to expose globbing syntax in my tool’s API.

I’m guessing that there are two possible sources for this slowness:

Duplicate stat or readdir calls (related: #76)
The pattern matching burns too many CPU cycles (perhaps minimatch is the culprit)

If someone wants to dig into this and figure out what’s going on, you’d be my hero!

If it’s technically hard to fix this, I’d also like to find out sooner rather than later, because it means that I maybe shouldn’t rely on globbing in my tool’s API.

About this issue

Original URL
State: closed
Created 11 years ago
Comments: 18 (7 by maintainers)

Commits related to this issue

Move sync glob function into separate class Step 1 of zalgo removal. Makes sync glob a bit faster, but I suspect that it'd be faster yet to just stat stuff rather than detect directories via a faile... — committed to isaacs/node-glob by isaacs 10 years ago

Most upvoted comments

@bpasero Even if you do the file actions asynchronously, there is usually a limit to just how async the fs can be, and there is a hard limit to how many threads node will use to perform async fs operations. (Once upon a time that was 4, but I recall it being increased, maybe there’s a cleverer heuristic involved since last I saw that code, which was a few years ago. It might not be as “hard”, but there is still definitely a limit.)

So, if you are parallelizing more fs ops than node allocates threads, it’s going to back up, and be effectively serialized anyway. (Of course, “serialized” with multiple parallel tracks, as each thread finishes what it’s doing and then takes on a new job.)

However, all of this is ultimately limited by the file system itself! Spinning disks can only be in one position, so if you’re trying to read 2 files in parallel, the fs will either thrash (making both operations slower than they would’ve been), or make one operation wait for the other.

Stat and readdir calls are less thrashy and more cacheable by the fs, since they’re only dealing with the file link table (at least, on most Unixy systems). But in fact, if every bit of data is available in the operating system’s fs cache, they’ll be very fast, and adding the async thread management overhead will be relevant.

“Parallelizing” operations on a single machine (whether that’s a single cpu or a single disk) is more like “multitasking” than “teamwork”. When you parallelize things across a network, of course, it’s a whole different story, which is why things like Hadoop are so great for massively big data problems, compared with writing a simpler program for a single machine. But when you parallelize a bunch of operations that are effectively throttled by a single bit of hardware, as a general rule, you’re almost certainly not going to make things faster.

Lots of qualifiers there. It’s highly dependent on a lot of factors, like what specific type of disk is involved, the operating system, the file system type, as well as just whatever else the machine happens to be doing at the time.

If you are doing many things in parallel (for example, it’s a web server, where you don’t want to pause in replying to one user’s HTTP request while you read a file for another user), then async is absolutely the way to go. Node makes this the “default” path because it’s designed for building network programs, where this usually matters.

If you’re a web server, then providing users with a consistently fast connection pick-up, and a predictable first-byte time, is all that really matters. The “average file read time” is 100% worthless as a metric, if you lost business because some unlucky user had to wait 2s for a web page to load, and decided to hit the back button.

But if you’re a cli utility, especially one that doesn’t make network requests, then the situation is completely reversed. The better approach is to avoid thrashing the disk, and since you have exactly one user who is going to wait for every single task to be completed, reducing the average task time is the way to go, so sync isn’t just easier, it’s also better.

isaacs on Jan 5, 2016

Ok! Initial investigation suggests that the biggest source of the delay is reliance on node’s fs.readdir to determine both directory-ness and contents. Since this uses a file descriptor, it’s getting wrapped by graceful-fs which is explicitly designed to trade speed in favor of managed resource utilization. So that sucks.

Additionally, there’s some just plain shoddy JavaScript that’s accreted over time in the readdir callback handling that V8 is having trouble optimizing. I think that whole area needs a thorough refactoring to remove a few steps, and especially since glob matching is being used in some very time-sensitive applications, all of the .bind() and .call() usage needs to be eliminated. Wrapping callbacks to do stuff at the end, tons of anonymous closures being defined inside of methods, it’s kind of a giant mess. I have enlisted the help of The Dark Pony Zalgo to have a single code path that is used for both sync and async behavior. I thought that I could preserve sanity if His Dark Tendrils were buried deep enough below the API layer. I was wrong.

Lastly, in the globstar case specifically, both bash 4.3 and the next version of node-glob have to do an lstat on all directories encountered in the crawl, since symlinks to directories should not be crawled. So, the Node statSync and readdirSync timing benchmark test is not quite an apples-to-apples comparison. It should probably lstatSync each directory as well as readdiring it. It’s still a LOT faster than the node-glob tests, but falls a bit more behind the bash/zsh test cases, as one would expect.

I’ve got the bash43 style globstar behavior implemented properly on a branch now, but it’s even slower. I’d like to refactor and optimize it to get the speed up before merging into master, which will make this much better.

isaacs on Nov 12, 2014