numba: Structured arrays 10 times slower than two dimensional arrays in nopython mode

Hello,

using structured arrays as suggested by Graham Markall ( https://groups.google.com/a/continuum.io/forum/#!topic/numba-users/Mqi9SXzqdTg ) instead of class properties is about 10 times slower than to use two dimensional arrays.

Implemented code example with two dimensional arrays: https://gist.github.com/ufechner7/d4fd1b75af0f1e5cd48c

Benchmark code: https://gist.github.com/ufechner7/95db14f734edd51dcd9b

Benchmark results:

time for numba sub_plain      [µs]:     0.33
time for numba sub_rec_array  [µs]:     1.97
time for numba sub_array      [µs]:     0.19

This prevents me from using record arrays, even though the code is shorter and cleaner, if they could be used.

Code using record arrays:

x_dt = np.dtype([('v_wind', np.float64),
                 ('v_wind_gnd', np.float64),
                 ('result', np.float64)])
buf = np.zeros((3, 3)) # initialize the record array with zeros
vec3 = np.recarray(3, dtype=x_dt, buf=buf) 

sub(vec3.v_wind, vec3.v_wind_gnd, vec3.result)  

Code using two dimensional arrays:

V_wind = 0 # (westwind, downwind direction to the east)
V_wind_gnd = 1
Result = 2 
vec3 = np.zeros((3, 3))

sub(vec3[V_wind], vec3[V_wind_gnd], vec3[Result]) 

It would be nice, if this could be fixed.

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 21 (11 by maintainers)

Most upvoted comments

It’s been suggested to me that there is a lack of documentation/explanation on what Numba actually does when it invokes a jitted function, so I hope that I can explain a bit more here about what Numba is doing in order to clarify things:

When you make a call to a function that has been decorated with the jit decorator, Numba has to figure out what the types of all the arguments are in order to either compile the jitted function for those arguments (because every time a new set of argument types is used, Numba must compile a specialised version of the function), or to retrieve a cached compiled version of the function for those specific arguments. In the case of scalar types and arrays of scalar types, this is fairly simple - each dtype has a unique integer that identifies it, so comparing argument types is just a case of comparing integers, and this is very fast. For structured types, it is more complicated, since every field needs to be compared. In order to do this, we have to call a function which computes the “data type descriptor” of the structured type (i.e. http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#c.PyArray_Descr). This takes much longer than just looking up the integer corresponding to a type, and also takes longer to compare with other data type descriptors for the already-compiled versions of the function.

This lookup of argument types and comparisons happens within the Numba dispatcher, which is also responsible for marshalling any arguments into their native types before passing them to the compiled code.

What you are seeing with the benchmark above, is that the time taken to call the function increases by about 2us when a structured type is involved, because of all this extra work that is going on in the dispatcher. The actual generated code has not slowed down, but its runtime is a very small part of the total runtime because it is much shorter than the time spent in the dispatcher.

In summary:

  • the time taken to call an empty jitted function is greater than the time taken to call an empty Python function, due to the work that the dispatcher does - this is the overhead of the dispatch.
  • the overhead of the dispatch is larger when structured types are involved.
  • The execution of the benchmark above is dominated mainly by the dispatch overhead, because the sub function’s body has a very short execution time.
  • Your complete example would be expected to be much less dominated by the dispatch overhead, because there is more computation to be done inside the jitted functions, and therefore I’d expect you not to observe the same ratio between execution times when using arrays and when using structured types in that code.
  • Or, to put it another way - the benchmark provided is unlikely to characterise the performance of your full example, because it does a lot less work inside the jitted function than the full example appears to.