arrow: [C++][Compute] Checked arithmetic functions are slow-ish

Describe the enhancement requested

Users should generally prefer using the checked versions of arithmetic functions such as “add”, because unchecked arithmetic is inherently unsafe and can silently corrupt data (MySQL is commonly hated for this). However, our checked arithmetic functions are quite slower than unchecked:

For example, “add” vs. “add_checked”:

$ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add(a, a)"
2000 loops, best of 5: 128 usec per loop
$ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add_checked(a, a)"
200 loops, best of 5: 1.92 msec per loop

… that’s 7.8 Gitems/sec vs. 0.5 Gitems/sec., or a 15x difference.

“sum” vs. “sum_checked” shows a similar problem in PR #37536.

Implementation-wise, “add_checked” is currently simple and naïve: it adds and detects overflow on each pair of values, one by one. This adds many conditional branches and probably stifles parallelism on the CPU, as well as auto-vectorization on the compiler.

A potentially better implementation would be to operate on K-sized batches (with a K such as 256), in two steps:

  1. detect potential overflow over all K pairs of values
  2. if no overflow was detected, add all K pairs or values

Each step is then unconditional and a reasonable candidate for auto-vectorization.

Of course, we can also consider that 0.5 Gitems/sec is good enough and unlikely to be a bottleneck for non-trivial workloads.

Component(s)

C++

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 15 (11 by maintainers)

Most upvoted comments

Add doesn’t check nulls and simply computes all pairs, null or not, while AddChecked only computes the valid pairs. (Code is here, it uses MakeArithmeticFunctionNotNull while Add uses MakeArithmeticFunction). I guess it was assumed that __builtin_*_overflow is an expensive operation and should be avoided as much as possible. However, I ran some benchmarks and the results don’t agree with this assumption. With null checking as it currently is, AddChecked is 15~40x slower and the benefits of null checks only happen when >95% values are null, which I’d say is a rare case. image Without null checking, i.e. also using MakeArithmeticFunction for AddChecked, it is only 2.5 times slower, no matter the null proportion of the input: image

This explains the difference between @pitrou and @cyb70289’s results. This is run on my Mac M1 so it maybe different on x86 machines, but the difference is clear enough.

I wrote a simple kernel exec that checks the nullity only when overflow occurs:

This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.