cudf: groupby() super slow on branch-0.7[BUG]
I found the groupby() is super slower on branch-0.7 than that on branch-0.6. The time output is attached.
df = DataFrame({'cat' : df_v,'cnt' : cnt ,'id' : id, 'cat1': df_v,'doc_length':doc_length})
%timeit df1 = df.groupby(['cat1','cat','id','doc_length'], method='hash').count()
Time outputs
#branch-0.7
17 s ± 443 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#branch-0.6
34.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 35 (16 by maintainers)
@MikeChenfu this issue should now be resolved on https://github.com/rapidsai/cudf/tree/release-0.7.2
Closing the issue. Please reopen if you’re still seeing the same performance issues.
@kovaltan is right, I had forgotten about the equivalence of signed and unsigned arithmetic with two’s complement. I’m pretty sure that NVIDIA GPUs use two’s complement. Otherwise compatibility of code with the CPU would be quite difficult. 😃 So this means we can fix the slow sum() aggregations for signed int64 too.
After a fun game of
git bisect, I narrowed down the offending commit to https://github.com/rapidsai/cudf/commit/542d960b2b16cbc31f51c2ef86267302e947bf65 which replaced the existing device atomic overloads previously used in the groupby implementation for those developed by @kovaltan. I suspect that a code path that was originally using a nativeatomicAddis now instead usingatomicCASand is much slower as a result.This is the repro I’ve been using:
When timing with the Linux
timecommand, a “good” run takes ~1.25s walltime and a “bad” run takes ~44s walltime.CC @harrism
@MikeChenfu glad to hear the workaround worked well for you. I’m going to keep this open until we resolve the performance issues in using a MultiIndex.
@MikeChenfu Can you try passing
as_index=Falseas another parameter to the groupby call? We’re working on resolving the MultiIndex performance issues and that should workaround it for the time being.