cudf: [BUG] Dask-Cudf losing index name in group-by on string column

Describe the bug

Dask seems to be loosing index name after groupby on string columns . This causes bugs downstream.

Steps/Code to reproduce bug

import dask_cudf
import cudf

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

None
1

Expected behavior

The expected behavior will be like for non string columns.

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':[1,2]*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

1
1

Environment overview (please complete the following information)

Method of cuDF install: conda

cudf                      0.11.0a191119         py37_3035    rapidsai-nightly
dask-cudf                 0.11.0a191119         py37_3035    rapidsai-nightly
libcudf                   0.11.0a191119     cuda10.1_2997    rapidsai-nightly

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 19 (12 by maintainers)

Commits related to this issue

fix #3420 by preserving index name in StringColumn concatination — committed to rjzamora/cudf by rjzamora 5 years ago
possible fix for #3420 in distributed — committed to rjzamora/cudf by rjzamora 5 years ago

Most upvoted comments

@rjzamora since this only occurs if there’s multiple partitions, think it could be serialization / deserialization related?

kkraus14 on Nov 22, 2019

This appears not be solved yet for the distributed scheduler, which may be related to the behavior in #3443

import dask_cudf
import cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
None
1

import dask_cudf
import cudf

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
1
1

beckernick on Nov 22, 2019