datafusion: Optimize Accumulator `size` function performance (fix regression on clickbench)

Is your feature request related to a problem or challenge? Please describe what you are trying to do. During regression benchmarks it was found that DISTINCT queries has a performance drop. The analysis showed the size function implementation for DistinctCountAccumulator is inefficient.

Describe the solution you’d like Need to improve size function or the way how number of bytes is collected

Describe alternatives you’ve considered None

Additional context Analysis details https://github.com/apache/arrow-datafusion/issues/5313 Original benchmarks ticket https://github.com/apache/arrow-datafusion/issues/5276

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Hi @yjshen

Here was my understanding of what you were proposing, which shows the diamond I am referring to.

I may be misunderstanding your proposal;

SELECT date, 
  COUNT(DISTINCT x), 
  COUNT(DISITNCT y)
FROM
  t;
                      ┌─────────────────┐                         
                      │ Combine somehow │   Maybe Join? Could     
                      │    (on date)    │   also be some more     
                      │                 │   optimized version     
                      └─────────────────┘                         
                               ▲                                  
                               │                                  
                               │                                  
            ┌──────────────────┴─────────────────────┐            
            │                                        │            
            │                                        │            
            │                                        │            
┌───────────────────────┐                ┌───────────────────────┐
│     HashAggregate     │                │     HashAggregate     │
│       gby: date       │                │       gby: date       │
│     agg: COUNT(x)     │                │     agg: COUNT(y)     │
└───────────────────────┘                └───────────────────────┘
            ▲                                        ▲            
            │                                        │            
            │                                        │            
┌───────────────────────┐                ┌───────────────────────┐
│     HashAggregate     │                │     HashAggregate     │
│     gby: date, x      │                │     gby: date, y      │
│      agg: <NONE>      │                │      agg: <NONE>      │
└───────────────────────┘                └───────────────────────┘
            ▲                                        ▲            
            │                                        │            
            │                                        │            
            └───────────────────┬────────────────────┘            
                                │                                 
                                │                                 
                                │                                 
                    ┌───────────────────────┐                     
                    │         Scan          │                     
                    └───────────────────────┘                     

@alamb I’ll take this ticket if not other volunteers, also I want to play a bit if we need super accurate size, probably we can do approx size which will serve to get the structure size with minor inaccuracy but will be faster