fd: fd is much slower when run with multiple threads
When not using -j1, fd takes thousands of times longer.
$ git clone https://github.com/sharkdp/fd.git
$ cd fd/
$ hyperfine -w 1 "fd" "fd -j1" -N
Benchmark 1: fd
Time (mean ± σ): 3.601 s ± 1.014 s [User: 0.008 s, System: 0.001 s]
Range (min … max): 3.280 s … 6.487 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: fd -j1
Time (mean ± σ): 3.0 ms ± 0.4 ms [User: 2.3 ms, System: 0.0 ms]
Range (min … max): 2.4 ms … 8.3 ms 792 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'fd -j1' ran
1212.48 ± 372.27 times faster than 'fd'
$ uname -a
Linux Ashtabula 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 27 (2 by maintainers)
I would appreciate if we could calm down a bit 😄. The current default was not chosen without reason. It’s based on benchmarks on my machine (8 core, see disclaimer concerning benchmarks in the README: one particular benchmark on one particular machine). You can see some past benchmark results here or here. Or I can run one right now, on a different machine (12-core):
As you can tell, using
N_threads=N_cores=12is not a very bad heuristic in this case.I think we even used(no, that was in a different - but similar - project: https://github.com/sharkdp/diskus/issues/38#issuecomment-612772867)N_threads = 3 × N_coresin the past, because that resulted in even better performance for either warm-cache or cold-cache searches (I don’t remember). But then we settled on the current strategy as a good tradeoff between the two scenarios.But I admit: startup time is a different story. In an empty directory, it looks like this:
But if I have to choose, I would definitely lean towards making long(er) searches faster, instead of optimizing startup time… which is completely negligible unless you’re running hundreds of searches inside tiny directories. But then you’Re probably using a script (where you can easily tune the number of
--threads).Now all that being said: if the current strategy shows unfavorable benchmark results on machines with N_cores ≫ 8, I’d be happy to do implement something like
min(N_cores, 12)as a default.Also, we digress. As @tavianator pointed out, this ticket is about WSL. So maybe let’s get back to that topic and open a new ticket to discuss a better default
--threadsstrategy (with actual benchmark results).The current “heuristics” is just to use the number of CPU cores, as returned by num_cpus::get. Maybe it would make more sense to use
get_physical? (which would return half the number of logical cores for your threadripper with hyper-threading).I think it would make sense to have a maximum on that for the default number of threads, although I’m not sure what the best value of that would be.
@tavianator @sharkdp I was tracking down why mold was running slower than lld and default linker in wsl2 and found this. Nonetheless:
It seems improved. It’s a hassle to optimize for with the weird threading behavior, but it’s very much appreciated, thank you guys for all the hard work.
and just for reference, on v8.7.1
@matu3ba This script works over here:
Output for me (after
chmod +x ~/Documents/fdtest.sh):It’s about a 43x slowdown. At least with this number of detected CPU’s. (I believe it’s actually, technically, 64 CPUs and 128 threads, but anyway.)
That’s why I asked what filesystem was being used. Accessing Windows files over 9p is slow in WSL2, but the OP is accessing Linux files in an ext4 filesystem. Since this is just the regular Linux ext4 implementation, it should be just about as fast as native Linux (except for the actual I/O).
That looks potentially relevant. Thread-local storage performs poorly on WSL2 for some reason.
WSL2 uses the “Virtual Machine Platform”, a subset of Hyper-V which is “something like KVM”. It should be close to native performance for things that don’t need to cross the hypervisor boundary often.
Did you build
fdyourself or install a pre-built version?