numba: Numba compiled functions releasing the GIL often fail to scale in parallel when Numpy and/or lists are involved

Reporting a bug

[ X] I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
[ X] I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

General description

I have now spent the better part of a week to gain some insight on issues which I eventually classified as bugs. I was new to Numba a week ago but that changed quite a bit 😉

Unfortunately, the issue is contorted, so I include a program which should best be played with. I can shorten it but this risks that a symptom gets fixed rather than the root cause.

To summarize:

I successfully Numba-compiled Numpy code with a >50x improvement in performance which is nice. It was necessary because Numpy or Scikit-learn lacked full support for my problem. However, 50x is not enough, need to use 8 cores.
Almost all attempts to parallelize failed. The sample code bug.py with its default parameters given in the start of the file demonstrates this: it requires 2x elapsed time despite 100% cpu load on all 8 cores! I.e., there is a 16x overhead!
Some parameters in my sample code demonstrate that 1/8x elapsed time is possible. Varying parameters shows that this is not a problem with the code I provide but with the interplay of numba, numpy and lists.
There are no problems of shared variables etc. The problem is embarrassingly parallel and I provide a sample w/o using prange() to disentangle matters.
The code is 172 loc and self-contained. Sorry that it takes that much to make the point.

A more detailed description follows and assumes, you familiarized with the code.

Detailed description and runtime results:

Base line

nthreads = 8
cpu_load = 1
test_nojit = True
[bug.py.txt](https://github.com/numba/numba/files/10325071/bug.py.txt)
[bug.py.txt](https://github.com/numba/numba/files/10325076/bug.py.txt)

threshold_method = 0
missing_method = 0

python bug.py prints:

nojit-serial time elapsed:       4,771.639 ms
jit-serial time elapsed:          54.956 ms
jit-parallel time elapsed:         117.336 ms

My machine has an Apple M1 Max CPU. Python 3.10.2. Numba 0.56.4

Best true performance

nthreads = 3
cpu_load = 1
test_nojit = True
threshold_method = 4
missing_method = 1

nojit-serial time elapsed:      15,526.656 ms
jit-serial time elapsed:          34.193 ms
jit-parallel time elapsed:          20.556 ms

I’ll drop the nojit results from here on because they are so slow (test_nojit = False). 20ms is the best I achieved so far (200x over numpy). But the parallel runtime is still bad and worsens if I increase nthreads.

Best fake performance

nthreads = 8
cpu_load = 100
test_nojit = False
threshold_method = -1
missing_method = 1

jit-serial time elapsed:         863.231 ms
jit-parallel time elapsed:         118.428 ms

The fake performance test (threshold_method = -1, cpu_load = 100) factors a part out of the iter loop and demonstrates that the remainder can scale almost perfectly (7.3x). I have run other tests to make sure that this is not due to limited memory bandwidth.

Plain performance

nthreads = 8
cpu_load = 100
test_nojit = False
nthreads = 8
threshold_method = 2
missing_method = 0

jit-serial time elapsed:       2,629.056 ms
jit-parallel time elapsed:       3,595.593 ms

and:

cpu_load = 1

jit-serial time elapsed:          42.678 ms
jit-parallel time elapsed:          38.864 ms

The last result is similiar to the best result and demonstrates that lists are slower. But both lists and ndarray – while fast in serial mode – can produce large parallel overhead preventing code to benefit from parallel execution.

Conclusion

This is a bug. Cautious choice of parameters can prevent it. Sometimes. But whenever a list with dynamic length must be constructed, either by Numpy or by [].append(), the generated code cannot efficiently run in parallel: It is still fast serially but causes all processors to max while wall time may even increase.

This seriously limits the usefulness of Numba over, e.g., using another LLVM language.

Please, comment and possibly fix.

Related bugs

A few bugs are related:

#2408 (open but different issue)
#2699 (closed but maybe a symptom of something bigger)

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 22 (7 by maintainers)

Most upvoted comments

Yes of course, apologies - didn’t want to open up duplicate issues unnecessary. I’ll put it there now.

tdgfrost on May 17, 2023

ping

LivingPages on Mar 31, 2023

This issue shall remain open until the test program provided above runs with no performance issues. I’ll rerun it as soon as another stable numba release becomes available.

LivingPages on Feb 27, 2023