duckdb: Window functions are very slow
Window functions are very slow - 5 seconds for 1M rows and a simple query
import time
import random
import duckdb
db = duckdb.connect(':memory:')
db.execute('create table data(id integer, value integer, primary key(id))')
values = []
for i in range(1_000_000):
values.append((i, random.randint(0, 100000),))
db.executemany('insert into data(id, value) VALUES (?, ?)', values)
db.commit()
start = time.time()
db.execute('select value, row_number() over(order by id) from data limit 1').fetchall()
duration = time.time() - start
print(f'{duration:.3f} s')
5.367 s
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (20 by maintainers)
Commits related to this issue
- Issue #1367: Window masks Convert the partition and peer locating code to use bitmasks that mark the boundaries instead of recpomuting them with slow Value comparisons each time. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window masks Fix tidy issues. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window masks size > 0 => !empty — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window masks id_t => idx_t — committed to hawkfish/duckdb by deleted user 3 years ago
- Merge pull request #1496 from hawkfish/hawkfish-window-mask Issue #1367: Window masks — committed to duckdb/duckdb by Mytherin 3 years ago
- Issue #1367: Window over sharing Collect all the matching over clauses before sorting. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window mask templates Convert the MaskColumn code to templatised versions. While it doesn't seem to imrpove performance at the moment (sorting is the bottleneck) it also fixes two bugs, ... — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window mask templates Fix iota include(?) and untidy test. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window mask templates Remove const. Require 512 block size for window test. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window mask templates Update chunk more frequently for small window sizes. — committed to hawkfish/duckdb by deleted user 3 years ago
- Merge pull request #1500 from hawkfish/hawkfish-window-over Issue #1367: Window mask templates — committed to duckdb/duckdb by hannes 3 years ago
- Issue #1367: Window hash partition Generalise Sort and Reorder as SortSlice and ReorderSlice. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window hash partition Use hash partitioning to break an OVER clause up into 1024 disjoint partitions. Sort these partitions independently. The code is currently single-threaded, but the... — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window hash partition Fix formatting. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window hash partition Another non-deterministic test. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window hash partition More non-deterministic tests. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window partitions Refactor utility functions to take row Slices instead of working on all partitions at once. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window partitions Compute window functions on partitions instead of on the entire table. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window hash partition Fix fencepost error in segment tree creation. Offset segment tree ranges for call. Suppress linting for canonical container inner class name. — committed to hawkfish/duckdb by deleted user 3 years ago
- Issue #1367: Window hash partitioning Use larger ranges for computing window functions. Pre-populate the partition bitmap with all the boundaries to avoid redundant Value comparisons. — committed to hawkfish/duckdb by deleted user 3 years ago
I’d be interested in looking at this. I’ve worked on the original Hyper Window operator and reimplemented it for the Tableau Data Engine. Do you have a list of tasks?
This brings up a larger issue around the current design of the physical window operator. If you look at Cao et al. it really suggests that logical window operator ordering is determined by the pair
(WPK, WOK)and multiple “functions” sharing this pair should be optimised as a single unit. If the physical window operator had a single pair like this, then there are multiple opportunities for performance improvement orthogonal to theValue-based sorting improvements proposed. In particular, a large amount of pre-processing can be performed on the input threads:SinkCombinethen merges/appends the chunks in the binsFinalizecan then spawn threads to independently perform the current operations, including sorting as in Leis et al.. Work stealing can also be implemented using a task queue.This approach also implies that parallelising the sort code may not be the most effective way to improve sorting performance unless there is only a single bin. At the very least, the sort operation might take a flag indicating whether parallel sorting is appropriate.
@stas-sl I believe this has been mostly mitigated thanks to @hawkfish . On my machine your example now runs in 0.3s . Closing this.
Masks improved the OP test by 5x on my MacBook Pro.
At a higher level, there is no attempt made to compare
OVERspecifications that I can see. So if the user has 12 window functions with identicalOVERspecifications (say from aWINDOWclause), this slow sorting will be repeated 12 times(!) A simple fix would be to pull theSortCollectionForWindowcall up to the loop and bin the window functions byOVERspecification.I ran into this today because I had 6 window queries with two functions (with identical
OVERs) and thought that if I made them into a single query it would get faster. But since all the time goes into sorting and the same sort is redone every time, it made little difference.I was thinking that we could replace both calls to
EqualsSubsetand one of the twoBinarySearchRightmostby just building two bitmaps (one for partitioning and one for ordering) that are 1 when a row is equal to the previous row. The partition bitmap starts as all 1s except for the first value. The order bitmap starts as a copy of the partition bitmap. The two calls toEqualsSubsetare reduced to bit tests and the firstBinarySearchRightmostis a bit scan.Constructing the bitmaps is column-wise so it should be easy to templatise using the physical types. to check for equality.
The second
BinarySearchRightmostseems to do too much work since it is guaranteed to be within the partition already, so there is no need to compare the partitioning columns? The ordering bitmap could then be used to determine the next peer boundary.