cudf: [FEA] dask-cudf doesn't support "corr"/correlation function like Pandas and cuDF

When attempting to perform a correlation like sales_corr = sales['pr_review_rating', 'count'].corr(sales['pr_review_rating', 'mean']) dask-cudf fails with the following error.

TypeError: cannot concatenate object of type "<class 'cudf.core.series.Series'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

It seems that we might limit this currently. However dask-cudf should behave exactly like cuDF and Pandas. https://github.com/rapidsai/cudf/blob/4613ba821e4ed03a2db744f2c0bb0959fd450191/python/dask_cudf/dask_cudf/backends.py#L30-L33

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 28 (16 by maintainers)

Most upvoted comments

The shape= argument was introduced by me in NumPy and CuPy to address exactly that shortcoming. It’s the only special case for array creation with __array_function__. The *_like functions will allow dispatching via __array_function__ according to the first argument (if NumPy, dispatch to NumPy itself, if CuPy dispatch to CuPy, etc.), and the new shape= argument allows us to create an arbitrarily-shaped array with the correct array type, which wasn’t possible before.

Confirmed this will work as expected, so no need for a Dask dispatch, sorry for false alarm 😅

In [1]: import numpy as np

In [2]: import cupy

In [3]: test = cupy.zeros(10)

In [4]: type(np.zeros_like(test, shape=5))
Out[4]: cupy.core.core.ndarray

It’s important to note that this failure was the product of an issue with cuDF and upstream libraries. That said, IIRC @rjzamora included a fix to cuDF and to Dask both of which include tests. @pentschev also implemented nansum in CuPy, which has its own test. So I think this is covered pretty well. That said, if there is another test you would like to add, I think that would be happily accepted 🙂

This should be resolved as of now with CuPy >= 7.