cudf: [FEA] Support drop_duplicates on Series containing list objects

Hi!

Is your feature request related to a problem? Please describe. It would be useful to be able to drop duplicates over a series which contains list objects.

Describe the solution you’d like To be able to drop duplicates on series containing list objects.

Describe alternatives you’ve considered We are using the code below, which is very CPU intensive.

# Remove duplicates 
compliant_indices["tmp"] = ['_'.join([str(z) for z in y]) for y in [sorted(x) for x in compliant_indices.values.tolist()]]
compliant_indices.drop_duplicates(subset="tmp", inplace=True)
compliant_indices.drop(columns=['tmp'], inplace=True)

Additional context Thanks!!!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 21 (18 by maintainers)

Most upvoted comments

This feature is now available in 23.06.

>>> df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0    [1, 3, 5, 7]
1       [2, 8, 4]
Name: a, dtype: list

GregoryKimball on May 31, 2023

Sorting within lists is supported on GPU as well, so I think you’ll be good once list->string concat is merged.

randerzander on Apr 21, 2021

In 22.06 drop_duplicates uses a sort-based algorithm and relies on the lexicographic comparator. We expect this will be closed by #11129

import cudf
df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
df['a'].drop_duplicates()

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/table/row_operators.cu:267: Cannot lexicographic compare a table with a LIST column

GregoryKimball on Aug 1, 2022

@miguelusque - a related issue to watch for the workaround

randerzander on Apr 27, 2021

@miguelusque it likely will not be. Operations that require comparing lists require us changes to our row comparator and have the potential to slow down all of libcudf if not done carefully. We’re trying to figure out a path forward here but it’s likely going to take some time.

kkraus14 on Apr 21, 2021