cudf: [FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient

Is your feature request related to a problem? Please describe.

https://github.com/rapidsai/cudf/issues/9411 made me take a closer look at cudf::drop_duplicates and cudf::distinct_count.

Unlike std::unique, both of these APIs will drop/count all unique rows across the entire table (as opposed to only contiguous equivalent rows).

On the surface, this seems convenient for a user because they don’t have to worry about sorting their dataframe if they want to get only the unique rows. However, it has an insidious impact on performance.

Imagine if you were to call distinct_count and then drop_duplicates. The way these functions are currently implemented, they both require doing a sort of the inputs.

Instead, if distinct_count and drop_duplicates worked like std::unique and the user had to first sort the input, then only one sort would be needed. Alternatively, the data may already be sorted (as is the case with Python indexes), where no sort would be necessary.

The current behavior is very sub-optimal for performance as it can require 2 redundant multi-column sorts. Multi-column sorts are among the most expensive operations in libcudf, so this is a bad thing.

(Furthermore, if the data isn’t already sorted, using a sort-based implementation is likely to be much slower than a hash-based implementation, so we should look at refactoring these implementations).

Describe the solution you’d like

I’d like to do two things:

Update drop_duplicates/distinct_count to work like std::unique, i.e., it only considers contiguous equivalent elements.
- This would require the data be presorted to preserve the existing behavior. Note that this also requires the user to describe how the data is sorted for things like null_order and such. We can look at groupby for an example of how we’ve handled this there.
Add unordered_drop_duplicates and unordered_distinct_count that behave like the current APIs.
- We should look at using a hash-based implementation for the unordered_* algorithms as it will likely be much faster

Describe alternatives you’ve considered

Why not preserve the existing APIs and add ordered_drop_duplicates and ordered_distinct_count?

While this is certainly an option, I think that the behavior I described above would be more canonical for C++.

Why have both unordered_* and ordered versions?

If the data is already sorted, there’s a good chance a sort based implementation could be faster than hash-based, but we can test that and see.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (16 by maintainers)

Commits related to this issue

Support structs for `cudf::contains` with column/scalar input (#9929) This PR adds support for `cudf::contains` so we can check whether a structs column contains a scalar struct element. Partially... — committed to rapidsai/cudf by ttnghia 2 years ago
Optimize compaction operations (#10030) Related to https://github.com/rapidsai/cudf/issues/9413. This PR adds `unordered_drop_duplicates`/`unordered_distinct_count` APIs by using hash-based algori... — committed to rapidsai/cudf by PointKernel 2 years ago
Refactor stream compaction APIs (#10370) Closes https://github.com/rapidsai/cudf/issues/9413 Depending on https://github.com/rapidsai/cudf/pull/10387. There are several changes involved in this... — committed to rapidsai/cudf by PointKernel 2 years ago

Most upvoted comments

As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of drop_duplicates with cuCollections and reports it’s 100x faster than the current sort-based solution.

jrhemstad on Oct 15, 2021

The fact that it calls out it uses a hash table means it’s doing a total unique

Let’s make sure to make our docs more clear. Understanding what our APIs do shouldn’t require inference from implementation details.

harrism on Oct 13, 2021

What does Pandas unique return for [4, 4, 0, 1, 2, 2, 4, 4]? std:unique would return [4, 0, 1, 2, 4].

Pandas would return [4, 0, 1, 2].

shwina on Oct 13, 2021