duckdb: Cumulative sum (window functions) are much slower than stock pandas

What happens?

Window function to compute cumulative sum is much slower than equivalent pandas code.

Their might be a more efficient way to get the equivalent result out of duck db but I’m using a sum with a window to obtain the solution which doesn’t seem to be very efficient.

I’m excited about the possibilities duckdb would give us and we were thinking about replacing some of my pandas code with more declarative SQL but it seems that these “cumulative” queries are still a bottleneck for us (simple group by and aggregations by contrast are super quick in duckdb).

The following is an attempt to show an example, producing the same result with duckdb and pandas:

To Reproduce

In [1]: import pandas as pd; import numpy as np; import duckdb; c = duckdb.connect()

In [2]: n = 1000_000; m=10_000; df = pd.DataFrame({"a": np.random.random(n), "b": np.random.random(n), "c": np.random.randint(0, m, n)}).reset_index()

In [3]: %timeit c.execute("SELECT index, (SUM(a) over (PARTITION BY c  ORDER BY index ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS a FROM df ORDER BY index")
554 ms ± 4.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.groupby("c")["a"].cumsum().reset_index()
24 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Result looks like this (obviously different regarding the random numbers):

         index          a
0            0   0.401120
1            1   0.139281
2            2   0.167592
3            3   0.688402
4            4   0.118918
...        ...        ...
999995  999995  56.911257
999996  999996  44.170181
999997  999997  47.532669
999998  999998  50.699650
999999  999999  48.280013

Environment (please complete the following information):

OS: Ubuntu 20.04.4 LTS
DuckDB Version: duckdb python package version: 0.3.3
DuckDB Client: Python

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19 (13 by maintainers)

Commits related to this issue

Issue #3453: Window Partition Collections Creating and scanning a ChunkCollection just wastes time. Removing it and building sort blocks as needed improves performance by 30%. — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Window Partition Benchmark Add the benchmark for the 30% perf improvement. — committed to hawkfish/duckdb by deleted user 2 years ago
Merge pull request #3514 from hawkfish/hawkfish-window-partition Issue #3453: Window Partition Collections — committed to duckdb/duckdb by Mytherin 2 years ago
Issue #3453: Early Window Partitions Move partitioning to the Sink phase so we only have to scan for it once. Partition into global sorts, to remove an unnecessary copy. Checkpoint commit of compilin... — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Early Window Partitions Move partitioning to the Sink phase so we only have to scan for it once. Partition into global sorts, to remove an unnecessary copy. Checkpoint commit of compilin... — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Early Window Partitions Fix some of the failures, starting with building the partition selections. Checkpoint commit of partially tested code. — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Early Window Partitions Checkpoint commit. Fix all the crash failures. Fix some sorting. Still some incorrect results at scale. — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Early Window Partitions Fix Combine and GetData so they don't lose data. All tests now pass. — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Early Window Partitions Rework hash groups into separate objects. Only use thread-local local sorts for large groups. Perf regression for window_partition.benchmark down to 1.8x, gated b... — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Early Window Partitions Release memory more aggressively. — committed to hawkfish/duckdb by deleted user 2 years ago
Issue #3453: Early Window Partitions Expand buffer pool allocation error message. — committed to hawkfish/duckdb by deleted user 2 years ago

Most upvoted comments

In my use case above, I was able to confirm that duckdb 0.5.0 is very fast. Thank you for the great library!

eitsupi on Sep 6, 2022

Hello!

I have validated your timings. As a general rule, DuckDB tends to be faster than Pandas with more complex queries since the optimizer can make wise choices on your behalf. Pandas has a few advantages in this case that are good for certain applications, but weaker in others. The main one is that Pandas has knowledge about the pre-existing order of the dataframe and maintains that order. This is part of why Pandas can only run on a single CPU core, since it can’t easily parallelize without upsetting that order. So, this is somewhat expected in this case.

I’ll also add a small note in case it is helpful! DuckDB is able to reuse Window definitions across multiple calculations. So this DuckDB query takes almost the same amount of time as the one you provided, but includes 3x the aggregates:

%%timeit 
c.execute("""
    SELECT 
        index, 
        SUM(a) over cumsum_per_c AS sum_a, 
        AVG(a) over cumsum_per_c AS avg_a, 
        COUNT(a) over cumsum_per_c AS avg_a 
    FROM df 
    WINDOW
        cumsum_per_c as (PARTITION BY c  ORDER BY index ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    ORDER BY index""")

Alex-Monahan on Apr 19, 2022

Very cool - thanks! More coming later this month I hope.

hawkfish on Sep 9, 2022

If you happen to have some microbenchmark results to share that would be awesome!

In this above, duckdb is several times faster than dplyr only. For reference, dtplyr (data.table) is more faster, but we should consider that the data source is an R data frame, so the conditions favor data.table.

set.seed(42)

df <- tidyr::expand_grid(
  date = seq_len(10^6) |> lubridate::as_date(),
  item = 1:4,
  shop = LETTERS[1:4]
) |>
  dplyr::mutate(
    sale = (stats::rnorm(n = dplyr::n()) * 100) |>
      abs() |>
      round(0) |>
      as.integer()
  )

res <- bench::mark(
  check = FALSE, # Only duckdb returns dbl, others return int.
  min_iterations = 5,
  dplyr = df |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup(),
  dtplyr = df |>
    dtplyr::lazy_dt() |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup() |>
    dplyr::collect(),
  duckdb = df |>
    arrow::to_duckdb() |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup() |>
    dplyr::collect(),
  `duckdb -> dplyr` = df |>
    arrow::to_duckdb() |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::collect() |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup()
)

res |> ggplot2::autoplot("violin")

eitsupi on Sep 6, 2022

I thought about starting a new issue, but I posted it here because I thought it was relevant to this issue.

This is the same issue so no worries. It looks like we are spending 70% of our time partitioning and sorting the large number of partitions.

hawkfish on Jun 12, 2022

I’ll have a look to make sure there isn’t something dumb, but I suspect that @Alex-Monahan is right. The current operator even uses our fast code and hash partitioning, but doing nothing is always faster than doing something!

hawkfish on Apr 19, 2022