mold: Performance regression in 1.8.0 (+ `lld` sometimes faster than mold)

While benchmarking mold on an in-house, proprietary application, I noticed that v1.8.0 is sometimes slower than v1.7.1.

In both cases, mold was built with GCC 10.3.0, and following strictly the build instructions in the README (including selecting the Release build type).

The application is a large, statically-linked C++20 Linux x86_64 binary, built using GCC 12.2.0. The linker flags used are -Wl,--no-keep-memory -Wl,--gdb-index -Wl,--threads=X (correct me if I am wrong, but it looks like --no-keep-memory is a no-op for both lld and mold)

Specs of the machine used to run the benchmark:

OS: Ubuntu 20.04 with kernel 5.4. To increase the result reliability, I disabled Turbo boost and set the CPU governor to performance (but saw the same pattern without doing so) CPU: Dual socket Intel Xeon Gold 5218, (2x 16 cores, with hyperthreading enabled) RAM: 128GB Disk: NVMe SSD

The benchmark deletes the binary (built in debug) in a fresh build folder, and calls ninja MyBinary. I made sure that nothing else of importance other than the linker were invoked. For each parameter sweep, one warmup run is done, followed by 3 timed runs. The binary is deleted before each run.

Here are the results. Notice how mold 1.8.0 becomes slower than 1.7.1 once 8 threads or more are used

Benchmark 1: mold180 (1 threads)
  Time (mean ± σ):     35.016 s ±  0.455 s
  Range (min … max):   34.697 s … 35.536 s

Benchmark 2: mold180 (2 threads)
  Time (mean ± σ):     21.217 s ±  0.345 s
  Range (min … max):   20.822 s … 21.456 s

Benchmark 3: mold180 (4 threads)
  Time (mean ± σ):     11.208 s ±  0.044 s
  Range (min … max):   11.162 s … 11.249 s

Benchmark 4: mold180 (8 threads)
  Time (mean ± σ):      6.851 s ±  0.102 s
  Range (min … max):    6.733 s …  6.915 s

Benchmark 5: mold180 (10 threads)
  Time (mean ± σ):      5.842 s ±  0.034 s
  Range (min … max):    5.813 s …  5.879 s

Benchmark 6: mold180 (12 threads)
  Time (mean ± σ):      4.936 s ±  0.020 s
  Range (min … max):    4.913 s …  4.948 s

Benchmark 7: mold180 (14 threads)
  Time (mean ± σ):      4.333 s ±  0.224 s
  Range (min … max):    4.192 s …  4.592 s

Benchmark 8: mold180 (16 threads)
  Time (mean ± σ):      3.979 s ±  0.117 s
  Range (min … max):    3.864 s …  4.098 s

Benchmark 9: mold180 (32 threads)
  Time (mean ± σ):      3.475 s ±  0.039 s
  Range (min … max):    3.441 s …  3.517 s

Benchmark 10: mold171 (1 threads)
  Time (mean ± σ):     35.475 s ±  0.128 s
  Range (min … max):   35.328 s … 35.557 s
Benchmark 11: mold171 (2 threads)
  Time (mean ± σ):     20.661 s ±  0.109 s
  Range (min … max):   20.547 s … 20.764 s

Benchmark 12: mold171 (4 threads)
  Time (mean ± σ):     11.314 s ±  0.076 s
  Range (min … max):   11.236 s … 11.387 s

Benchmark 13: mold171 (8 threads)
  Time (mean ± σ):      6.280 s ±  0.182 s
  Range (min … max):    6.166 s …  6.490 s

Benchmark 14: mold171 (10 threads)
  Time (mean ± σ):      5.368 s ±  0.107 s
  Range (min … max):    5.245 s …  5.436 s

Benchmark 15: mold171 (12 threads)
  Time (mean ± σ):      4.560 s ±  0.131 s
  Range (min … max):    4.474 s …  4.711 s

Benchmark 16: mold171 (14 threads)
  Time (mean ± σ):      4.083 s ±  0.111 s
  Range (min … max):    3.976 s …  4.197 s

Benchmark 17: mold171 (16 threads)
  Time (mean ± σ):      3.780 s ±  0.046 s
  Range (min … max):    3.727 s …  3.807 s

Benchmark 18: mold171 (32 threads)
  Time (mean ± σ):      3.358 s ±  0.022 s
  Range (min … max):    3.332 s …  3.372 s

Side note: Not sure if this is expected, but in this benchmark lld is actually significantly faster than mold when using 1 thread, and runs roughly at the same speed than mold when using 2 threads. mold only becomes faster when using at least 4 threads:

Benchmark 1: lld (1 threads)
  Time (mean ± σ):     28.303 s ±  0.174 s    [User: 24.492 s, System: 3.966 s]
  Range (min … max):   28.103 s … 28.417 s    3 runs

Benchmark 2: lld (2 threads)
  Time (mean ± σ):     20.584 s ±  0.813 s    [User: 26.643 s, System: 7.346 s]
  Range (min … max):   19.681 s … 21.258 s    3 runs

Benchmark 3: lld (4 threads)
  Time (mean ± σ):     13.111 s ±  0.013 s    [User: 26.480 s, System: 5.690 s]
  Range (min … max):   13.098 s … 13.125 s    3 runs

Benchmark 4: lld (8 threads)
  Time (mean ± σ):     10.060 s ±  0.111 s    [User: 27.491 s, System: 6.177 s]
  Range (min … max):    9.964 s … 10.180 s    3 runs

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19 (9 by maintainers)

Commits related to this issue

Mitigate lock contention This change speeds up the `create_output_sections` pass. https://github.com/rui314/mold/issues/937 — committed to rui314/mold by rui314 a year ago
Implement another algorithm to optimize create_output_sections https://github.com/rui314/mold/issues/937 — committed to rui314/mold by rui314 a year ago

Most upvoted comments

Thank you for investigating! Let me investigate it on my machine as well. I’ll probably create a patch to try to fix the perf regression and share it with you before submitting, so that you can try to see if it’ll actually fix your problem.

rui314 on Jan 4, 2023

Thank you for testing! So the results show that we didn’t only recover the lost speed but actually gained a little bit. I like the second algo than the first one, so I’ll keep the code as-is. Thanks again for reporting the issue and testing.

rui314 on Jan 11, 2023

With NUMA-affinity (not benchmarking ca98fe843dc9076ef95d9b70afd64df9d2b8e963 for brevity):

8614fbbaff2d0c378fb7421f13452be195931cae

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_mold180
  Time (mean ± σ):      4.522 s ±  0.078 s
  Range (min … max):    4.468 s …  4.705 s

ecd61a4fed8ab4bcc9737efd662631d06b07bbdf :

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_mold180
  Time (mean ± σ):      4.808 s ±  0.072 s
  Range (min … max):    4.752 s …  4.962 s

75e059af5b8ba9ce6d6f9c84f8346e71c37c1788

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_mold180
  Time (mean ± σ):      4.635 s ±  0.038 s
  Range (min … max):    4.603 s …  4.729 s

7132822cc7c5b1aaa16b64e66b78d5fcc8f02563

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_mold180
  Time (mean ± σ):      4.480 s ±  0.016 s
  Range (min … max):    4.457 s …  4.510 s

d2cdce44107f72736079d7fee53da4144b55febc :

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_mold180
  Time (mean ± σ):      4.479 s ±  0.037 s
  Range (min … max):    4.442 s …  4.556 s

Again, both 7132822cc7c5b1aaa16b64e66b78d5fcc8f02563 and d2cdce44107f72736079d7fee53da4144b55febc help, but this time there is no clear winner between the two. Given that d2cdce44107f72736079d7fee53da4144b55febc is the winner when not setting NUMA-affinity (which is probably how most users will run mold), it would make sense to me to keep d2cdce44107f72736079d7fee53da4144b55febc .

Thank you for all your work.

moncefmechri on Jan 10, 2023