numba: Numba Sort is Significantly Slower Than NumPy Sort

We are leveraging Numba in the STUMPY package and @mexxexx and I are seeing a significant difference in performance when sorting a NumPy array. The original profiling can be found here but I’ll paste the results below:

I did some profiling regarding the sorting aspect (jupyter notebook in firefox, Intel® Core™ i5-6500 CPU @ 3.20GHz, 4 threads):

def sort_numpy(T, d, k):
    return np.sort(T, axis=0)

@njit(parallel=True, fastmath=True)
def sort_numba(T, d, k):
    for i in prange(k):
        # T[:,i] = np.sort(T[:, i]) ## no assignment as we don't want to actually sort T
        np.sort(T[:, i])

T = np.random.uniform(-100, 100, (d, k))

I precompiled sort_numba. and got the following results:

d=5, k=10000:

>>> %timeit -n 1000 sort_numpy(T, d, k)
295 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -n 100 sort_numba(T, d, k)
6.69 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

d=5, k=100000:

>>> %timeit -n 1000 sort_numpy(T, d, k)
3.08 ms ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -n 100 sort_numba(T, d, k)
66.2 ms ± 82.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

d=50, k=10000:

>>> %timeit -n 1000 sort_numpy(T, d, k)
11.2 ms ± 81.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -n 1000 sort_numba(T, d, k)
6.58 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So sorting with numpy is orders of magnitude faster for a small number of dimensions and long time serieses, the case that we usually have.

Is this expected or are we doing something wrong here?

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 17 (13 by maintainers)

Most upvoted comments

I can repro this too:

 💣 zsh» python tmp.py
The mean time for sort_numpy was 488.8987541198731 μs and the standard deviation was 43.920950070242135 μs.
OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
The mean time for sort_numba was 652.3394584655762 μs and the standard deviation was 44.03392231405237 μs.
The mean time for sort_numpy was 5185.270309448242 μs and the standard deviation was 628.184829390657 μs.
The mean time for sort_numba was 7743.899822235107 μs and the standard deviation was 457.2864641283301 μs.
The mean time for sort_numpy was 9848.854541778564 μs and the standard deviation was 871.6013331967821 μs.
The mean time for sort_numba was 4342.429637908936 μs and the standard deviation was 781.7597993442256 μs.
python tmp.py  24.66s user 0.57s system 382% cpu 6.601 total

Machine details:

Screen Shot 2022-05-16 at 14 51 37

TL;DR I can reproduce the results

Script: tmp.py.zip Results:

The mean time for sort_numpy was 493.91984939575195 μs and the standard deviation was 11.842128855966518 μs.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
The mean time for sort_numba was 577.0373344421387 μs and the standard deviation was 92.17707428796716 μs.
The mean time for sort_numpy was 5011.153221130371 μs and the standard deviation was 135.17833508354994 μs.
The mean time for sort_numba was 7179.431915283203 μs and the standard deviation was 421.8190407178981 μs.
The mean time for sort_numpy was 10645.644664764404 μs and the standard deviation was 242.4166365658219 μs.
The mean time for sort_numba was 5054.254531860352 μs and the standard deviation was 707.8448591388241 μs.

This is on an 8-core MacBook Pro 14", 2021.

I can look into it, but the difference certainly shouldn’t be an order of magnitude. My proposal was to replace merge sort by radix sort like in NumPy, which can be faster in some cases but not all. It shouldn’t have an effect on your case as you don’t use merge sort (it’s only used if you pass a kwarg).

So I’m not entirely sure what is happening here.