duckdb: Cumulative sum (window functions) are much slower than stock pandas
What happens?
Window function to compute cumulative sum is much slower than equivalent pandas code.
Their might be a more efficient way to get the equivalent result out of duck db but I’m using a sum with a window to obtain the solution which doesn’t seem to be very efficient.
I’m excited about the possibilities duckdb would give us and we were thinking about replacing some of my pandas code with more declarative SQL but it seems that these “cumulative” queries are still a bottleneck for us (simple group by and aggregations by contrast are super quick in duckdb).
The following is an attempt to show an example, producing the same result with duckdb and pandas:
To Reproduce
In [1]: import pandas as pd; import numpy as np; import duckdb; c = duckdb.connect()
In [2]: n = 1000_000; m=10_000; df = pd.DataFrame({"a": np.random.random(n), "b": np.random.random(n), "c": np.random.randint(0, m, n)}).reset_index()
In [3]: %timeit c.execute("SELECT index, (SUM(a) over (PARTITION BY c ORDER BY index ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS a FROM df ORDER BY index")
554 ms ± 4.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df.groupby("c")["a"].cumsum().reset_index()
24 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Result looks like this (obviously different regarding the random numbers):
index a
0 0 0.401120
1 1 0.139281
2 2 0.167592
3 3 0.688402
4 4 0.118918
... ... ...
999995 999995 56.911257
999996 999996 44.170181
999997 999997 47.532669
999998 999998 50.699650
999999 999999 48.280013
Environment (please complete the following information):
- OS: Ubuntu 20.04.4 LTS
- DuckDB Version: duckdb python package version: 0.3.3
- DuckDB Client: Python
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (13 by maintainers)
Commits related to this issue
- Issue #3453: Window Partition Collections Creating and scanning a ChunkCollection just wastes time. Removing it and building sort blocks as needed improves performance by 30%. — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Window Partition Benchmark Add the benchmark for the 30% perf improvement. — committed to hawkfish/duckdb by deleted user 2 years ago
- Merge pull request #3514 from hawkfish/hawkfish-window-partition Issue #3453: Window Partition Collections — committed to duckdb/duckdb by Mytherin 2 years ago
- Issue #3453: Early Window Partitions Move partitioning to the Sink phase so we only have to scan for it once. Partition into global sorts, to remove an unnecessary copy. Checkpoint commit of compilin... — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Early Window Partitions Move partitioning to the Sink phase so we only have to scan for it once. Partition into global sorts, to remove an unnecessary copy. Checkpoint commit of compilin... — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Early Window Partitions Fix some of the failures, starting with building the partition selections. Checkpoint commit of partially tested code. — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Early Window Partitions Checkpoint commit. Fix all the crash failures. Fix some sorting. Still some incorrect results at scale. — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Early Window Partitions Fix Combine and GetData so they don't lose data. All tests now pass. — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Early Window Partitions Rework hash groups into separate objects. Only use thread-local local sorts for large groups. Perf regression for window_partition.benchmark down to 1.8x, gated b... — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Early Window Partitions Release memory more aggressively. — committed to hawkfish/duckdb by deleted user 2 years ago
- Issue #3453: Early Window Partitions Expand buffer pool allocation error message. — committed to hawkfish/duckdb by deleted user 2 years ago
In my use case above, I was able to confirm that duckdb 0.5.0 is very fast. Thank you for the great library!
Hello!
I have validated your timings. As a general rule, DuckDB tends to be faster than Pandas with more complex queries since the optimizer can make wise choices on your behalf. Pandas has a few advantages in this case that are good for certain applications, but weaker in others. The main one is that Pandas has knowledge about the pre-existing order of the dataframe and maintains that order. This is part of why Pandas can only run on a single CPU core, since it can’t easily parallelize without upsetting that order. So, this is somewhat expected in this case.
I’ll also add a small note in case it is helpful! DuckDB is able to reuse Window definitions across multiple calculations. So this DuckDB query takes almost the same amount of time as the one you provided, but includes 3x the aggregates:
Very cool - thanks! More coming later this month I hope.
In this above,
duckdbis several times faster thandplyronly. For reference,dtplyr(data.table) is more faster, but we should consider that the data source is an R data frame, so the conditions favordata.table.This is the same issue so no worries. It looks like we are spending 70% of our time partitioning and sorting the large number of partitions.
I’ll have a look to make sure there isn’t something dumb, but I suspect that @Alex-Monahan is right. The current operator even uses our fast code and hash partitioning, but doing nothing is always faster than doing something!