duckdb: Cumulative sum (window functions) are much slower than stock pandas

What happens?

Window function to compute cumulative sum is much slower than equivalent pandas code.

Their might be a more efficient way to get the equivalent result out of duck db but I’m using a sum with a window to obtain the solution which doesn’t seem to be very efficient.

I’m excited about the possibilities duckdb would give us and we were thinking about replacing some of my pandas code with more declarative SQL but it seems that these “cumulative” queries are still a bottleneck for us (simple group by and aggregations by contrast are super quick in duckdb).

The following is an attempt to show an example, producing the same result with duckdb and pandas:

To Reproduce

In [1]: import pandas as pd; import numpy as np; import duckdb; c = duckdb.connect()

In [2]: n = 1000_000; m=10_000; df = pd.DataFrame({"a": np.random.random(n), "b": np.random.random(n), "c": np.random.randint(0, m, n)}).reset_index()

In [3]: %timeit c.execute("SELECT index, (SUM(a) over (PARTITION BY c  ORDER BY index ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS a FROM df ORDER BY index")
554 ms ± 4.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.groupby("c")["a"].cumsum().reset_index()
24 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Result looks like this (obviously different regarding the random numbers):

         index          a
0            0   0.401120
1            1   0.139281
2            2   0.167592
3            3   0.688402
4            4   0.118918
...        ...        ...
999995  999995  56.911257
999996  999996  44.170181
999997  999997  47.532669
999998  999998  50.699650
999999  999999  48.280013

Environment (please complete the following information):

  • OS: Ubuntu 20.04.4 LTS
  • DuckDB Version: duckdb python package version: 0.3.3
  • DuckDB Client: Python

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (13 by maintainers)

Commits related to this issue

Most upvoted comments

In my use case above, I was able to confirm that duckdb 0.5.0 is very fast. Thank you for the great library!

Hello!

I have validated your timings. As a general rule, DuckDB tends to be faster than Pandas with more complex queries since the optimizer can make wise choices on your behalf. Pandas has a few advantages in this case that are good for certain applications, but weaker in others. The main one is that Pandas has knowledge about the pre-existing order of the dataframe and maintains that order. This is part of why Pandas can only run on a single CPU core, since it can’t easily parallelize without upsetting that order. So, this is somewhat expected in this case.

I’ll also add a small note in case it is helpful! DuckDB is able to reuse Window definitions across multiple calculations. So this DuckDB query takes almost the same amount of time as the one you provided, but includes 3x the aggregates:

%%timeit 
c.execute("""
    SELECT 
        index, 
        SUM(a) over cumsum_per_c AS sum_a, 
        AVG(a) over cumsum_per_c AS avg_a, 
        COUNT(a) over cumsum_per_c AS avg_a 
    FROM df 
    WINDOW
        cumsum_per_c as (PARTITION BY c  ORDER BY index ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    ORDER BY index""")

Very cool - thanks! More coming later this month I hope.

If you happen to have some microbenchmark results to share that would be awesome!

In this above, duckdb is several times faster than dplyr only. For reference, dtplyr (data.table) is more faster, but we should consider that the data source is an R data frame, so the conditions favor data.table.

image

set.seed(42)

df <- tidyr::expand_grid(
  date = seq_len(10^6) |> lubridate::as_date(),
  item = 1:4,
  shop = LETTERS[1:4]
) |>
  dplyr::mutate(
    sale = (stats::rnorm(n = dplyr::n()) * 100) |>
      abs() |>
      round(0) |>
      as.integer()
  )

res <- bench::mark(
  check = FALSE, # Only duckdb returns dbl, others return int.
  min_iterations = 5,
  dplyr = df |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup(),
  dtplyr = df |>
    dtplyr::lazy_dt() |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup() |>
    dplyr::collect(),
  duckdb = df |>
    arrow::to_duckdb() |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup() |>
    dplyr::collect(),
  `duckdb -> dplyr` = df |>
    arrow::to_duckdb() |>
    dplyr::group_by(date, item) |>
    dplyr::summarise(sale = sum(sale, na.rm = TRUE), .groups = "drop_last") |>
    dplyr::collect() |>
    dplyr::mutate(total = sum(sale, na.rm = TRUE)) |>
    dplyr::ungroup()
)

res |> ggplot2::autoplot("violin")

I thought about starting a new issue, but I posted it here because I thought it was relevant to this issue.

This is the same issue so no worries. It looks like we are spending 70% of our time partitioning and sorting the large number of partitions.

I’ll have a look to make sure there isn’t something dumb, but I suspect that @Alex-Monahan is right. The current operator even uses our fast code and hash partitioning, but doing nothing is always faster than doing something!