cudf: [FEA] Support drop_duplicates on Series containing list objects
Hi!
Is your feature request related to a problem? Please describe. It would be useful to be able to drop duplicates over a series which contains list objects.
Describe the solution you’d like To be able to drop duplicates on series containing list objects.
Describe alternatives you’ve considered We are using the code below, which is very CPU intensive.
# Remove duplicates
compliant_indices["tmp"] = ['_'.join([str(z) for z in y]) for y in [sorted(x) for x in compliant_indices.values.tolist()]]
compliant_indices.drop_duplicates(subset="tmp", inplace=True)
compliant_indices.drop(columns=['tmp'], inplace=True)
Additional context Thanks!!!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21 (18 by maintainers)
This feature is now available in 23.06.
Sorting within lists is supported on GPU as well, so I think you’ll be good once list->string concat is merged.
In 22.06
drop_duplicatesuses a sort-based algorithm and relies on the lexicographic comparator. We expect this will be closed by #11129@miguelusque - a related issue to watch for the workaround
@miguelusque it likely will not be. Operations that require comparing lists require us changes to our row comparator and have the potential to slow down all of libcudf if not done carefully. We’re trying to figure out a path forward here but it’s likely going to take some time.