numba: Numba Sort is Significantly Slower Than NumPy Sort
We are leveraging Numba in the STUMPY package and @mexxexx and I are seeing a significant difference in performance when sorting a NumPy array. The original profiling can be found here but I’ll paste the results below:
I did some profiling regarding the sorting aspect (jupyter notebook in firefox, Intel® Core™ i5-6500 CPU @ 3.20GHz, 4 threads):
def sort_numpy(T, d, k):
return np.sort(T, axis=0)
@njit(parallel=True, fastmath=True)
def sort_numba(T, d, k):
for i in prange(k):
# T[:,i] = np.sort(T[:, i]) ## no assignment as we don't want to actually sort T
np.sort(T[:, i])
T = np.random.uniform(-100, 100, (d, k))
I precompiled
sort_numba. and got the following results:
d=5, k=10000:
>>> %timeit -n 1000 sort_numpy(T, d, k)
295 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -n 100 sort_numba(T, d, k)
6.69 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
d=5, k=100000:
>>> %timeit -n 1000 sort_numpy(T, d, k)
3.08 ms ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -n 100 sort_numba(T, d, k)
66.2 ms ± 82.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
d=50, k=10000:
>>> %timeit -n 1000 sort_numpy(T, d, k)
11.2 ms ± 81.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -n 1000 sort_numba(T, d, k)
6.58 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So sorting with numpy is orders of magnitude faster for a small number of dimensions and long time serieses, the case that we usually have.
Is this expected or are we doing something wrong here?
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 17 (13 by maintainers)
@hameerabbasi is hoping to work on this for 0.53, xref: https://github.com/numba/numba/wiki/Minutes_2020_10_20
I can repro this too:
Machine details:
TL;DR I can reproduce the results
Script: tmp.py.zip Results:
This is on an 8-core MacBook Pro 14", 2021.
I can look into it, but the difference certainly shouldn’t be an order of magnitude. My proposal was to replace merge sort by radix sort like in NumPy, which can be faster in some cases but not all. It shouldn’t have an effect on your case as you don’t use merge sort (it’s only used if you pass a kwarg).
So I’m not entirely sure what is happening here.