cudf: [FEA] Support drop_duplicates on Series containing list objects

Hi!

Is your feature request related to a problem? Please describe. It would be useful to be able to drop duplicates over a series which contains list objects.

Describe the solution you’d like To be able to drop duplicates on series containing list objects.

Describe alternatives you’ve considered We are using the code below, which is very CPU intensive.

# Remove duplicates 
compliant_indices["tmp"] = ['_'.join([str(z) for z in y]) for y in [sorted(x) for x in compliant_indices.values.tolist()]]
compliant_indices.drop_duplicates(subset="tmp", inplace=True)
compliant_indices.drop(columns=['tmp'], inplace=True)

Additional context Thanks!!!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (18 by maintainers)

Most upvoted comments

This feature is now available in 23.06.

>>> df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0    [1, 3, 5, 7]
1       [2, 8, 4]
Name: a, dtype: list

Sorting within lists is supported on GPU as well, so I think you’ll be good once list->string concat is merged.

In 22.06 drop_duplicates uses a sort-based algorithm and relies on the lexicographic comparator. We expect this will be closed by #11129

import cudf
df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
df['a'].drop_duplicates()
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/table/row_operators.cu:267: Cannot lexicographic compare a table with a LIST column

@miguelusque - a related issue to watch for the workaround

@miguelusque it likely will not be. Operations that require comparing lists require us changes to our row comparator and have the potential to slow down all of libcudf if not done carefully. We’re trying to figure out a path forward here but it’s likely going to take some time.