duckdb: Window functions are very slow

Window functions are very slow - 5 seconds for 1M rows and a simple query

import time
import random
import duckdb

db = duckdb.connect(':memory:')
db.execute('create table data(id integer, value integer, primary key(id))')

values = []
for i in range(1_000_000):
    values.append((i, random.randint(0, 100000),))
db.executemany('insert into data(id, value) VALUES (?, ?)', values)
db.commit()

start = time.time()
db.execute('select value, row_number() over(order by id) from data limit 1').fetchall()
duration = time.time() - start
print(f'{duration:.3f} s')

5.367 s

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (20 by maintainers)

Commits related to this issue

Most upvoted comments

I’d be interested in looking at this. I’ve worked on the original Hyper Window operator and reimplemented it for the Tableau Data Engine. Do you have a list of tasks?

This brings up a larger issue around the current design of the physical window operator. If you look at Cao et al. it really suggests that logical window operator ordering is determined by the pair (WPK, WOK) and multiple “functions” sharing this pair should be optimised as a single unit. If the physical window operator had a single pair like this, then there are multiple opportunities for performance improvement orthogonal to the Value-based sorting improvements proposed. In particular, a large amount of pre-processing can be performed on the input threads:

  • The sort keys, payloads, frame boundaries and lead/lag parameters can be computed in parallel in Sink
  • The partitioning keys can then be used immediately to hash these chunks into separate bins on each thread
  • Combine then merges/appends the chunks in the bins
  • Finalize can then spawn threads to independently perform the current operations, including sorting as in Leis et al.. Work stealing can also be implemented using a task queue.

This approach also implies that parallelising the sort code may not be the most effective way to improve sorting performance unless there is only a single bin. At the very least, the sort operation might take a flag indicating whether parallel sorting is appropriate.

@stas-sl I believe this has been mostly mitigated thanks to @hawkfish . On my machine your example now runs in 0.3s . Closing this.

Masks improved the OP test by 5x on my MacBook Pro.

At a higher level, there is no attempt made to compare OVER specifications that I can see. So if the user has 12 window functions with identical OVER specifications (say from a WINDOW clause), this slow sorting will be repeated 12 times(!) A simple fix would be to pull the SortCollectionForWindow call up to the loop and bin the window functions by OVER specification.

I ran into this today because I had 6 window queries with two functions (with identical OVERs) and thought that if I made them into a single query it would get faster. But since all the time goes into sorting and the same sort is redone every time, it made little difference.

I was thinking that we could replace both calls to EqualsSubset and one of the two BinarySearchRightmost by just building two bitmaps (one for partitioning and one for ordering) that are 1 when a row is equal to the previous row. The partition bitmap starts as all 1s except for the first value. The order bitmap starts as a copy of the partition bitmap. The two calls to EqualsSubset are reduced to bit tests and the first BinarySearchRightmost is a bit scan.

Constructing the bitmaps is column-wise so it should be easy to templatise using the physical types. to check for equality.

The second BinarySearchRightmost seems to do too much work since it is guaranteed to be within the partition already, so there is no need to compare the partitioning columns? The ordering bitmap could then be used to determine the next peer boundary.