datafusion: Optimize Accumulator `size` function performance (fix regression on clickbench)
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
During regression benchmarks it was found that DISTINCT queries has a performance drop. The analysis showed the size
function implementation for DistinctCountAccumulator
is inefficient.
Describe the solution you’d like Need to improve size function or the way how number of bytes is collected
Describe alternatives you’ve considered None
Additional context Analysis details https://github.com/apache/arrow-datafusion/issues/5313 Original benchmarks ticket https://github.com/apache/arrow-datafusion/issues/5276
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (16 by maintainers)
Hi @yjshen
Here was my understanding of what you were proposing, which shows the diamond I am referring to.
I may be misunderstanding your proposal;
@alamb I’ll take this ticket if not other volunteers, also I want to play a bit if we need super accurate size, probably we can do approx size which will serve to get the structure size with minor inaccuracy but will be faster