dask: Var calculation fails with non-pandas Dataframe

When using dask-cudf, the new parallel variance calculation will fail (PR https://github.com/dask/dask/pull/4865/) . This is probably due to two main causes

  • __array_interface__ isn’t fully implemented: np.X (var/nansum) may not be supported but can be added see https://github.com/rapidsai/cudf/issues/1728
  • Tightly using pandas signatures: pd.Series -> Series(), pd.concat -> Concat

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

I believe we can close this now as a lot of work has gone into cudf to support __array_function__ calls:

In [1]: import cudf

In [2]: import dask.datasets

In [3]: ddf = dask.datasets.timeseries()

In [4]: cdf = ddf.map_partitions(cudf.from_pandas)

In [5]: cdf.head()
Out[5]:
                       id     name         x         y
timestamp
2000-01-01 00:00:00   983   George  0.355031 -0.200610
2000-01-01 00:00:01  1011    Kevin -0.784447 -0.212326
2000-01-01 00:00:02   991   George -0.295418  0.403791
2000-01-01 00:00:03   989  Norbert  0.474926 -0.494320
2000-01-01 00:00:04   963   Ingrid -0.845560 -0.586977

In [6]: cdf.x.var().compute()
Out[6]: 0.33295403416731695

In [7]: ddf.x.var().compute()
Out[7]: 0.33295403416731695

https://github.com/cupy/cupy/pull/2252 has just been merged, let me know if you encounter any issues with that @quasiben .