arrow: [C++][Compute] Checked arithmetic functions are slow-ish
Describe the enhancement requested
Users should generally prefer using the checked versions of arithmetic functions such as “add”, because unchecked arithmetic is inherently unsafe and can silently corrupt data (MySQL is commonly hated for this). However, our checked arithmetic functions are quite slower than unchecked:
For example, “add” vs. “add_checked”:
$ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add(a, a)"
2000 loops, best of 5: 128 usec per loop
$ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add_checked(a, a)"
200 loops, best of 5: 1.92 msec per loop
… that’s 7.8 Gitems/sec vs. 0.5 Gitems/sec., or a 15x difference.
“sum” vs. “sum_checked” shows a similar problem in PR #37536.
Implementation-wise, “add_checked” is currently simple and naïve: it adds and detects overflow on each pair of values, one by one. This adds many conditional branches and probably stifles parallelism on the CPU, as well as auto-vectorization on the compiler.
A potentially better implementation would be to operate on K-sized batches (with a K such as 256), in two steps:
- detect potential overflow over all K pairs of values
- if no overflow was detected, add all K pairs or values
Each step is then unconditional and a reasonable candidate for auto-vectorization.
Of course, we can also consider that 0.5 Gitems/sec is good enough and unlikely to be a bottleneck for non-trivial workloads.
Component(s)
C++
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 15 (11 by maintainers)
Add
doesn’t check nulls and simply computes all pairs, null or not, whileAddChecked
only computes the valid pairs. (Code is here, it uses MakeArithmeticFunctionNotNull whileAdd
uses MakeArithmeticFunction). I guess it was assumed that__builtin_*_overflow
is an expensive operation and should be avoided as much as possible. However, I ran some benchmarks and the results don’t agree with this assumption. With null checking as it currently is,AddChecked
is 15~40x slower and the benefits of null checks only happen when >95% values are null, which I’d say is a rare case.MakeArithmeticFunction
forAddChecked
, it is only 2.5 times slower, no matter the null proportion of the input:This explains the difference between @pitrou and @cyb70289’s results. This is run on my Mac M1 so it maybe different on x86 machines, but the difference is clear enough.
This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.