cudf: [BUG] Dask-Cudf losing index name in group-by on string column
Describe the bug
Dask seems to be loosing index name after groupby on string columns . This causes bugs downstream.
Steps/Code to reproduce bug
import dask_cudf
import cudf
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
Output
None
1
Expected behavior
The expected behavior will be like for non string columns.
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':[1,2]*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
Output
1
1
Environment overview (please complete the following information)
- Method of cuDF install: conda
cudf 0.11.0a191119 py37_3035 rapidsai-nightly
dask-cudf 0.11.0a191119 py37_3035 rapidsai-nightly
libcudf 0.11.0a191119 cuda10.1_2997 rapidsai-nightly
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (12 by maintainers)
Commits related to this issue
- fix #3420 by preserving index name in StringColumn concatination — committed to rjzamora/cudf by rjzamora 5 years ago
- possible fix for #3420 in distributed — committed to rjzamora/cudf by rjzamora 5 years ago
@rjzamora since this only occurs if there’s multiple partitions, think it could be serialization / deserialization related?
This appears not be solved yet for the distributed scheduler, which may be related to the behavior in #3443