coreutils: uutils' factor is much slower than GNU's coreutils factor

After building uutils using cargo build --release on Fedora 32 x86-64:

[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | ./target/release/uutils factor | md5sum)
4cfd4f52505c4e3852c373b8b2e8a628  -
( seq 2 "$(( 10 ** 6 ))" | ./target/release/uutils factor | md5sum; )  48.35s user 4.08s system 108% cpu 48.240 total
[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum)
4cfd4f52505c4e3852c373b8b2e8a628  -
( seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum; )  0.25s user 0.10s system 160% cpu 0.221 total

A ~200 times performance loss is very bad.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 25 (11 by maintainers)

Most upvoted comments

@shlomif You are welcome 😃

I left the bug open through my performance-related PRs to factor, as it’s not yet as fast as the GNU implementation, but I’m hoping to get there… and beyond! 😈

I can confirm @shlomif results. For numbers less than 104, it is about 2.4× slower, for 105 about 3.5× slower, for 106 about 4.5× slower, for 107 about 4.3× slower and for 108 about 3.2× slower. For example, here is a benchmark of the system GNU factor (factor), a locally compiled version of GNU factor (./factor) and uutils’ factor (./uu-factor):

Benchmark #1: seq 0 1000000 | factor
  Time (x̅ mean ± σ std dev):      0.3190s ±  0.0051s          [User: 0.2982s, System: 0.0504s]
  Range (min … x̃ median … max):   0.311s …  0.322s …  0.324s   CPU: 109.3%, 5 runs

Benchmark #2: seq 0 1000000 | ./factor
  Time (x̅ mean ± σ std dev):      0.3080s ±  0.0088s          [User: 0.2640s, System: 0.0706s]
  Range (min … x̃ median … max):   0.295s …  0.308s …  0.318s   CPU: 108.6%, 5 runs

Benchmark #3: seq 0 1000000 | ./uu-factor
  Time (x̅ mean ± σ std dev):      1.3884s ±  0.0332s          [User: 1.3698s, System: 0.0276s]
  Range (min … x̃ median … max):   1.347s …  1.383s …  1.444s   CPU: 100.6%, 5 runs

Summary
  #2 ‘seq 0 1000000 | ./factor’ ran
    1.036 ± 0.034 times (103.6%) faster than #1 ‘seq 0 1000000 | factor’
    4.508 ± 0.168 times (450.8%) faster than #3 ‘seq 0 1000000 | ./uu-factor’

(This is similar to @rivy’s benchmarks, but using my Bash port of hyperfine, which outputs more info.)

However, if you go backwards 106 from 264, it is 5.8× slower:

Benchmark #1: seq 18446744073708551615 18446744073709551615 | factor
  Time (x̅ mean ± σ std dev):     58.2510s ±  0.2072s          [User: 58.1622s, System: 2.4806s]
  Range (min … x̃ median … max):  58.040s … 58.163s … 58.525s   CPU: 104.1%, 5 runs

Benchmark #2: seq 18446744073708551615 18446744073709551615 | ./factor
  Time (x̅ mean ± σ std dev):     58.0552s ±  0.1607s          [User: 58.0212s, System: 2.5270s]
  Range (min … x̃ median … max):  57.800s … 58.063s … 58.307s   CPU: 104.3%, 5 runs

Benchmark #3: seq 18446744073708551615 18446744073709551615 | ./uu-factor
  Time (x̅ mean ± σ std dev):     336.7276s ±  0.4459s          [User: 336.7042s, System: 0.3726s]
  Range (min … x̃ median … max):  335.909s … 336.824s … 337.240s   CPU: 100.1%, 5 runs

Summary
  #2 ‘seq 18446744073708551615 18446744073709551615 | ./factor’ ran
    1.003 ± 0.005 times (100.3%) faster than #1 ‘seq 18446744073708551615 18446744073709551615 | factor’
    5.800 ± 0.018 times (580.0%) faster than #3 ‘seq 18446744073708551615 18446744073709551615 | ./uu-factor’

Since it seems to be slower for larger numbers, it follows that the ratio will be even worse once #1559 is fixed. While the performance has definitely improved a lot, I think this issue should at least be left open until it is no more than around two times slower in all cases, particularly after arbitrary-precision numbers are supported.

FYI, if you want to make up a 4x performance improvement, doing so is pretty easy here as https://en.algorithmica.org/hpc/algorithms/factorization/ has a good overview for fast factoring of relatively small numbers in Rust. The key things missing here is Pollard-Brent Algorithm which lets you compute the GCD much less often and should yield a roughly 5x speedup. (you could also improve the gcd computation but that is a smaller detail).

For performance on >64 bits, you probably would need ecm if you want top of the line performance, but that gets complicated fairly quickly.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I’m happy to leave this an an open issue to encourage more refinement, with a plan to revisit next year.

PR’s are welcome!