cudf: [BUG] Dask-Cudf losing index name in group-by on string column

Describe the bug

Dask seems to be loosing index name after groupby on string columns . This causes bugs downstream.

Steps/Code to reproduce bug

import dask_cudf
import cudf

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

None
1

Expected behavior

The expected behavior will be like for non string columns.

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':[1,2]*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

1
1

Environment overview (please complete the following information)

  • Method of cuDF install: conda
cudf                      0.11.0a191119         py37_3035    rapidsai-nightly
dask-cudf                 0.11.0a191119         py37_3035    rapidsai-nightly
libcudf                   0.11.0a191119     cuda10.1_2997    rapidsai-nightly

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (12 by maintainers)

Commits related to this issue

Most upvoted comments

@rjzamora since this only occurs if there’s multiple partitions, think it could be serialization / deserialization related?

This appears not be solved yet for the distributed scheduler, which may be related to the behavior in #3443

import dask_cudf
import cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
None
1
import dask_cudf
import cudf
​
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
1
1