duckdb: Window functions are very slow

Window functions are very slow - 5 seconds for 1M rows and a simple query

import time
import random
import duckdb

db = duckdb.connect(':memory:')
db.execute('create table data(id integer, value integer, primary key(id))')

values = []
for i in range(1_000_000):
    values.append((i, random.randint(0, 100000),))
db.executemany('insert into data(id, value) VALUES (?, ?)', values)
db.commit()

start = time.time()
db.execute('select value, row_number() over(order by id) from data limit 1').fetchall()
duration = time.time() - start
print(f'{duration:.3f} s')

5.367 s

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 25 (20 by maintainers)

Commits related to this issue

Issue #1367: Window masks Convert the partition and peer locating code to use bitmasks that mark the boundaries instead of recpomuting them with slow Value comparisons each time. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window masks Fix tidy issues. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window masks size > 0 => !empty — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window masks id_t => idx_t — committed to hawkfish/duckdb by deleted user 3 years ago
Merge pull request #1496 from hawkfish/hawkfish-window-mask Issue #1367: Window masks — committed to duckdb/duckdb by Mytherin 3 years ago
Issue #1367: Window over sharing Collect all the matching over clauses before sorting. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window mask templates Convert the MaskColumn code to templatised versions. While it doesn't seem to imrpove performance at the moment (sorting is the bottleneck) it also fixes two bugs, ... — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window mask templates Fix iota include(?) and untidy test. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window mask templates Remove const. Require 512 block size for window test. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window mask templates Update chunk more frequently for small window sizes. — committed to hawkfish/duckdb by deleted user 3 years ago
Merge pull request #1500 from hawkfish/hawkfish-window-over Issue #1367: Window mask templates — committed to duckdb/duckdb by hannes 3 years ago
Issue #1367: Window hash partition Generalise Sort and Reorder as SortSlice and ReorderSlice. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window hash partition Use hash partitioning to break an OVER clause up into 1024 disjoint partitions. Sort these partitions independently. The code is currently single-threaded, but the... — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window hash partition Fix formatting. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window hash partition Another non-deterministic test. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window hash partition More non-deterministic tests. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window partitions Refactor utility functions to take row Slices instead of working on all partitions at once. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window partitions Compute window functions on partitions instead of on the entire table. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window hash partition Fix fencepost error in segment tree creation. Offset segment tree ranges for call. Suppress linting for canonical container inner class name. — committed to hawkfish/duckdb by deleted user 3 years ago
Issue #1367: Window hash partitioning Use larger ranges for computing window functions. Pre-populate the partition bitmap with all the boundaries to avoid redundant Value comparisons. — committed to hawkfish/duckdb by deleted user 3 years ago

Most upvoted comments

I’d be interested in looking at this. I’ve worked on the original Hyper Window operator and reimplemented it for the Tableau Data Engine. Do you have a list of tasks?

hawkfish on Feb 17, 2021

This brings up a larger issue around the current design of the physical window operator. If you look at Cao et al. it really suggests that logical window operator ordering is determined by the pair (WPK, WOK) and multiple “functions” sharing this pair should be optimised as a single unit. If the physical window operator had a single pair like this, then there are multiple opportunities for performance improvement orthogonal to the Value-based sorting improvements proposed. In particular, a large amount of pre-processing can be performed on the input threads:

The sort keys, payloads, frame boundaries and lead/lag parameters can be computed in parallel in Sink
The partitioning keys can then be used immediately to hash these chunks into separate bins on each thread
Combine then merges/appends the chunks in the bins
Finalize can then spawn threads to independently perform the current operations, including sorting as in Leis et al.. Work stealing can also be implemented using a task queue.

This approach also implies that parallelising the sort code may not be the most effective way to improve sorting performance unless there is only a single bin. At the very least, the sort operation might take a flag indicating whether parallel sorting is appropriate.

hawkfish on Mar 12, 2021

@stas-sl I believe this has been mostly mitigated thanks to @hawkfish . On my machine your example now runs in 0.3s . Closing this.

hannes on Apr 15, 2021

Masks improved the OP test by 5x on my MacBook Pro.

hawkfish on Mar 15, 2021

At a higher level, there is no attempt made to compare OVER specifications that I can see. So if the user has 12 window functions with identical OVER specifications (say from a WINDOW clause), this slow sorting will be repeated 12 times(!) A simple fix would be to pull the SortCollectionForWindow call up to the loop and bin the window functions by OVER specification.

I ran into this today because I had 6 window queries with two functions (with identical OVERs) and thought that if I made them into a single query it would get faster. But since all the time goes into sorting and the same sort is redone every time, it made little difference.

hawkfish on Mar 11, 2021

I was thinking that we could replace both calls to EqualsSubset and one of the two BinarySearchRightmost by just building two bitmaps (one for partitioning and one for ordering) that are 1 when a row is equal to the previous row. The partition bitmap starts as all 1s except for the first value. The order bitmap starts as a copy of the partition bitmap. The two calls to EqualsSubset are reduced to bit tests and the first BinarySearchRightmost is a bit scan.

Constructing the bitmaps is column-wise so it should be easy to templatise using the physical types. to check for equality.

The second BinarySearchRightmost seems to do too much work since it is guaranteed to be within the partition already, so there is no need to compare the partitioning columns? The ordering bitmap could then be used to determine the next peer boundary.

hawkfish on Mar 13, 2021