cudf: [FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient

Is your feature request related to a problem? Please describe.

https://github.com/rapidsai/cudf/issues/9411 made me take a closer look at cudf::drop_duplicates and cudf::distinct_count.

Unlike std::unique, both of these APIs will drop/count all unique rows across the entire table (as opposed to only contiguous equivalent rows).

On the surface, this seems convenient for a user because they don’t have to worry about sorting their dataframe if they want to get only the unique rows. However, it has an insidious impact on performance.

Imagine if you were to call distinct_count and then drop_duplicates. The way these functions are currently implemented, they both require doing a sort of the inputs.

Instead, if distinct_count and drop_duplicates worked like std::unique and the user had to first sort the input, then only one sort would be needed. Alternatively, the data may already be sorted (as is the case with Python indexes), where no sort would be necessary.

The current behavior is very sub-optimal for performance as it can require 2 redundant multi-column sorts. Multi-column sorts are among the most expensive operations in libcudf, so this is a bad thing.

(Furthermore, if the data isn’t already sorted, using a sort-based implementation is likely to be much slower than a hash-based implementation, so we should look at refactoring these implementations).

Describe the solution you’d like

I’d like to do two things:

  1. Update drop_duplicates/distinct_count to work like std::unique, i.e., it only considers contiguous equivalent elements.
    • This would require the data be presorted to preserve the existing behavior. Note that this also requires the user to describe how the data is sorted for things like null_order and such. We can look at groupby for an example of how we’ve handled this there.
  2. Add unordered_drop_duplicates and unordered_distinct_count that behave like the current APIs.
    • We should look at using a hash-based implementation for the unordered_* algorithms as it will likely be much faster

Describe alternatives you’ve considered

Why not preserve the existing APIs and add ordered_drop_duplicates and ordered_distinct_count?

While this is certainly an option, I think that the behavior I described above would be more canonical for C++.

Why have both unordered_* and ordered versions?

If the data is already sorted, there’s a good chance a sort based implementation could be faster than hash-based, but we can test that and see.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (16 by maintainers)

Commits related to this issue

Most upvoted comments

As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of drop_duplicates with cuCollections and reports it’s 100x faster than the current sort-based solution.

The fact that it calls out it uses a hash table means it’s doing a total unique

Let’s make sure to make our docs more clear. Understanding what our APIs do shouldn’t require inference from implementation details.

What does Pandas unique return for [4, 4, 0, 1, 2, 2, 4, 4]? std:unique would return [4, 0, 1, 2, 4].

Pandas would return [4, 0, 1, 2].