cudf: [FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient
Is your feature request related to a problem? Please describe.
https://github.com/rapidsai/cudf/issues/9411 made me take a closer look at cudf::drop_duplicates and cudf::distinct_count.
Unlike std::unique, both of these APIs will drop/count all unique rows across the entire table (as opposed to only contiguous equivalent rows).
On the surface, this seems convenient for a user because they don’t have to worry about sorting their dataframe if they want to get only the unique rows. However, it has an insidious impact on performance.
Imagine if you were to call distinct_count and then drop_duplicates. The way these functions are currently implemented, they both require doing a sort of the inputs.
Instead, if distinct_count and drop_duplicates worked like std::unique and the user had to first sort the input, then only one sort would be needed. Alternatively, the data may already be sorted (as is the case with Python indexes), where no sort would be necessary.
The current behavior is very sub-optimal for performance as it can require 2 redundant multi-column sorts. Multi-column sorts are among the most expensive operations in libcudf, so this is a bad thing.
(Furthermore, if the data isn’t already sorted, using a sort-based implementation is likely to be much slower than a hash-based implementation, so we should look at refactoring these implementations).
Describe the solution you’d like
I’d like to do two things:
- Update
drop_duplicates/distinct_countto work likestd::unique, i.e., it only considers contiguous equivalent elements.- This would require the data be presorted to preserve the existing behavior. Note that this also requires the user to describe how the data is sorted for things like null_order and such. We can look at
groupbyfor an example of how we’ve handled this there.
- This would require the data be presorted to preserve the existing behavior. Note that this also requires the user to describe how the data is sorted for things like null_order and such. We can look at
- Add
unordered_drop_duplicatesandunordered_distinct_countthat behave like the current APIs.- We should look at using a hash-based implementation for the
unordered_*algorithms as it will likely be much faster
- We should look at using a hash-based implementation for the
Describe alternatives you’ve considered
Why not preserve the existing APIs and add
ordered_drop_duplicatesandordered_distinct_count?
While this is certainly an option, I think that the behavior I described above would be more canonical for C++.
Why have both
unordered_*and ordered versions?
If the data is already sorted, there’s a good chance a sort based implementation could be faster than hash-based, but we can test that and see.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (16 by maintainers)
Commits related to this issue
- Support structs for `cudf::contains` with column/scalar input (#9929) This PR adds support for `cudf::contains` so we can check whether a structs column contains a scalar struct element. Partially... — committed to rapidsai/cudf by ttnghia 2 years ago
- Optimize compaction operations (#10030) Related to https://github.com/rapidsai/cudf/issues/9413. This PR adds `unordered_drop_duplicates`/`unordered_distinct_count` APIs by using hash-based algori... — committed to rapidsai/cudf by PointKernel 2 years ago
- Refactor stream compaction APIs (#10370) Closes https://github.com/rapidsai/cudf/issues/9413 Depending on https://github.com/rapidsai/cudf/pull/10387. There are several changes involved in this... — committed to rapidsai/cudf by PointKernel 2 years ago
As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of
drop_duplicateswith cuCollections and reports it’s 100x faster than the current sort-based solution.Let’s make sure to make our docs more clear. Understanding what our APIs do shouldn’t require inference from implementation details.
Pandas would return
[4, 0, 1, 2].