fd: fd is much slower when run with multiple threads

When not using -j1, fd takes thousands of times longer.

$ git clone https://github.com/sharkdp/fd.git
$ cd fd/
$ hyperfine -w 1 "fd" "fd -j1" -N
Benchmark 1: fd
  Time (mean ± σ):      3.601 s ±  1.014 s    [User: 0.008 s, System: 0.001 s]
  Range (min … max):    3.280 s …  6.487 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: fd -j1
  Time (mean ± σ):       3.0 ms ±   0.4 ms    [User: 2.3 ms, System: 0.0 ms]
  Range (min … max):     2.4 ms …   8.3 ms    792 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  'fd -j1' ran
 1212.48 ± 372.27 times faster than 'fd'
$ uname -a
Linux Ashtabula 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 27 (2 by maintainers)

Most upvoted comments

right, but isn’t there something wrong with the “heuristics” mentioned in the manpages for not specifying the number of jobs, if simply not specifying how many threads you want it to use results in worse or even worst-case performance on a given CPU architecture? I don’t even know if it’s a configuration option, because I’d hate to have to specify it for literally every search request. (In my case it’s not terribly inconvenient since I already use a wrapper function around it, but still.) Whatever “heuristics” are used should be torn out and just default to something like “half the number of cores with a max of 6/8” because after that you’re probably bottlenecked on either disk I/O or the setup/teardown of the threads, anyway…

I would appreciate if we could calm down a bit 😄. The current default was not chosen without reason. It’s based on benchmarks on my machine (8 core, see disclaimer concerning benchmarks in the README: one particular benchmark on one particular machine). You can see some past benchmark results here or here. Or I can run one right now, on a different machine (12-core):

hyperfine \
    --parameter-scan threads 1 16 \
    --warmup 3 \
    --export-json results.json \
    "fd -j {threads}"

image

As you can tell, using N_threads=N_cores=12 is not a very bad heuristic in this case. I think we even used N_threads = 3 × N_cores in the past, because that resulted in even better performance for either warm-cache or cold-cache searches (I don’t remember). But then we settled on the current strategy as a good tradeoff between the two scenarios. (no, that was in a different - but similar - project: https://github.com/sharkdp/diskus/issues/38#issuecomment-612772867)

But I admit: startup time is a different story. In an empty directory, it looks like this:

image

But if I have to choose, I would definitely lean towards making long(er) searches faster, instead of optimizing startup time… which is completely negligible unless you’re running hundreds of searches inside tiny directories. But then you’Re probably using a script (where you can easily tune the number of --threads).

Now all that being said: if the current strategy shows unfavorable benchmark results on machines with N_cores ≫ 8, I’d be happy to do implement something like min(N_cores, 12) as a default.

Also, we digress. As @tavianator pointed out, this ticket is about WSL. So maybe let’s get back to that topic and open a new ticket to discuss a better default --threads strategy (with actual benchmark results).

Whatever “heuristics” are used should be torn out and just default to something like “half the number of cores with a max of 6/8”

The current “heuristics” is just to use the number of CPU cores, as returned by num_cpus::get. Maybe it would make more sense to use get_physical? (which would return half the number of logical cores for your threadripper with hyper-threading).

I think it would make sense to have a maximum on that for the default number of threads, although I’m not sure what the best value of that would be.

@tavianator @sharkdp I was tracking down why mold was running slower than lld and default linker in wsl2 and found this. Nonetheless:

~
❯ fd --version
fdfind 9.0.0

~
❯ hyperfine -w 50 "fd" "fd -j1" -N
Benchmark 1: fd
  Time (mean ± σ):       5.8 ms ±   0.8 ms    [User: 12.0 ms, System: 3.3 ms]
  Range (min … max):     4.1 ms …   9.2 ms    480 runs

Benchmark 2: fd -j1
  Time (mean ± σ):       8.0 ms ±   1.6 ms    [User: 6.7 ms, System: 2.7 ms]
  Range (min … max):     4.8 ms …  16.0 ms    526 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  fd ran
    1.38 ± 0.33 times faster than fd -j1

It seems improved. It’s a hassle to optimize for with the weird threading behavior, but it’s very much appreciated, thank you guys for all the hard work.

and just for reference, on v8.7.1
~
❯ hyperfine -w 50 "./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd" "./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1"
-N
Benchmark 1: ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd
  Time (mean ± σ):      24.4 ms ±   3.2 ms    [User: 6.7 ms, System: 29.4 ms]
  Range (min … max):    17.3 ms …  49.7 ms    114 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1
  Time (mean ± σ):       8.1 ms ±   1.5 ms    [User: 4.3 ms, System: 2.8 ms]
  Range (min … max):     5.3 ms …  13.0 ms    487 runs

Summary
  ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1 ran
    3.03 ± 0.68 times faster than ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd

@matu3ba This script works over here:

#!/usr/bin/env bash

_fdtest() {
  local - # scant bash docs on this but this apparently automatically resets shellopts when the function exits
  set -o errexit
  local _testlocname=$(echo $RANDOM | md5sum | cut -c1-8)
  local _testloc="/tmp/$_testlocname"
  local cpu_count=$(awk '/^processor/{n+=1}END{print n}' /proc/cpuinfo)
  echo "Testing $_testloc with $cpu_count CPUs"
  # the point of [ 1 == 0 ] below is to fail the line and trigger errexit IF errexit is set
  mkdir -p $_testloc >/dev/null 2>&1 || ( echo "Cannot create test directory '$_testloc' in _fdtest: ${BASH_SOURCE[0]}:${BASH_LINENO[0]}"; [ 1 == 0 ] )
  touch $_testloc/$_testlocname
  pushd $_testloc >/dev/null
  echo
  echo -n "Without -j1 argument:"
  time for ((n=0;n<10;n++)); do fd $_testlocname >/dev/null; done
  echo
  echo -n "With -j1 argument:"
  time for ((n=0;n<10;n++)); do fd -j1 $_testlocname >/dev/null; done
  popd >/dev/null
  rm $_testloc/$_testlocname
  rm -d $_testloc
}

_fdtest

Output for me (after chmod +x ~/Documents/fdtest.sh):

❯ ~/Documents/fdtest.sh
Testing /tmp/5ee7987d with 128 CPUs

Without -j1 argument:
real    0m1.665s
user    0m0.164s
sys     0m1.560s

With -j1 argument:
real    0m0.038s
user    0m0.007s
sys     0m0.033s

It’s about a 43x slowdown. At least with this number of detected CPU’s. (I believe it’s actually, technically, 64 CPUs and 128 threads, but anyway.)

WSL2 has severe and known performance issues, for example this one is specific for the filesystem: microsoft/WSL#4197

That’s why I asked what filesystem was being used. Accessing Windows files over 9p is slow in WSL2, but the OP is accessing Linux files in an ext4 filesystem. Since this is just the regular Linux ext4 implementation, it should be just about as fast as native Linux (except for the actual I/O).

Only threading alone has up to 5x performance penalties dotnet/runtime#42994

That looks potentially relevant. Thread-local storage performs poorly on WSL2 for some reason.

More over, WSL2 is a full VM, so you will never get performance close to a native Linux Kernel https://learn.microsoft.com/en-us/windows/wsl/compare-versions. For that, something like kvm would be needed, but those things are only available as proprietary products on Windows.

WSL2 uses the “Virtual Machine Platform”, a subset of Hyper-V which is “something like KVM”. It should be close to native performance for things that don’t need to cross the hypervisor boundary often.

Did you build fd yourself or install a pre-built version?