cudf: [BUG] Rolling window aggregations are very slow with large windows

With large windows, the .rolling() function in cuDF can be pathologically slow:

In [6]: dt = cudf.date_range("2001-01-01", "2002-01-01", freq="1s")
In [7]: df = cudf.DataFrame({"x": np.random.rand(len(dt))}, index=dt)
In [8]: %time df.rolling("1D").sum()
CPU times: user 10.3 s, sys: 57.1 ms, total: 10.3 s
Wall time: 10.4 s
Out[8]:
                                x
2001-01-01 00:00:00      0.815418
2001-01-01 00:00:01      1.238151
2001-01-01 00:00:02      1.811390
2001-01-01 00:00:03      2.065794
2001-01-01 00:00:04      2.195230
...                           ...
2001-12-31 23:59:55  43308.909704
2001-12-31 23:59:56  43309.098228
2001-12-31 23:59:57  43308.658888
2001-12-31 23:59:58  43308.790256
2001-12-31 23:59:59  43308.915838

[31536000 rows x 1 columns]

Why is it slow?

Of the 10s of execution time above, about 8s is spent in computing the window sizes, which is done in a hand-rolled numba CUDA kernel: https://github.com/rapidsai/cudf/blob/6f6e521257dce5732eea7b6b9d56243f8b0a69cc/python/cudf/cudf/utils/cudautils.py#L17. Note that running the code through a profiler will show execution time being spent in the next CUDA kernel (column.full) - but that’s a red herring I think, because there’s no synchronization after the numba kernel call.

What can we do about it?

I see a couple of options here:

  1. I wonder if there’s a better way to write that kernel. Currently, it naively launches one thread per element, and does a linear search for the next element that would exceed the window bounds.
  2. We could make it libcudf’s responsibility to compute the window sizes. I believe they already do window sizes computation in the context of grouped rolling window aggreagations: see grouped_range_rolling_window().

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Reactions: 1
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Oh sorry, I meant a host profiler (in this case the Python profiler cprofile)

In your benchmark you’re constructing the inputs to the libcudf rolling function by hand. But going through the public API takes you down a code path that uses a numba kernel to do that.

Ah sorry, yes, now I see it. I was tricked by the lack of synchronisation in the numba kernel launch.

Yes, that kernel has exactly the same problem the rolling window kernel does. Each row linearly searches backwards in the column until the difference between the preceding entry and the current one is larger than the requested offset.

I think you can do this by doing a reverse prefix scan of the differences between the entries in the to-be-windowed column, and then … (brain out of gear again)