cudf: groupby() super slow on branch-0.7[BUG]

I found the groupby() is super slower on branch-0.7 than that on branch-0.6. The time output is attached.

df = DataFrame({'cat' : df_v,'cnt' : cnt ,'id' : id, 'cat1': df_v,'doc_length':doc_length})   
%timeit df1 = df.groupby(['cat1','cat','id','doc_length'], method='hash').count()

Time outputs

#branch-0.7
17 s ± 443 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#branch-0.6
34.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 35 (16 by maintainers)

Most upvoted comments

@MikeChenfu this issue should now be resolved on https://github.com/rapidsai/cudf/tree/release-0.7.2

Closing the issue. Please reopen if you’re still seeing the same performance issues.

jrhemstad on May 15, 2019

@kovaltan is right, I had forgotten about the equivalence of signed and unsigned arithmetic with two’s complement. I’m pretty sure that NVIDIA GPUs use two’s complement. Otherwise compatibility of code with the CPU would be quite difficult. 😃 So this means we can fix the slow sum() aggregations for signed int64 too.

harrism on May 14, 2019

After a fun game of git bisect, I narrowed down the offending commit to https://github.com/rapidsai/cudf/commit/542d960b2b16cbc31f51c2ef86267302e947bf65 which replaced the existing device atomic overloads previously used in the groupby implementation for those developed by @kovaltan. I suspect that a code path that was originally using a native atomicAdd is now instead using atomicCAS and is much slower as a result.

This is the repro I’ve been using:

import numpy as np
import cudf

size = 5373090
min_int = np.iinfo('int32').min
max_int = np.iinfo('int32').max

keith_gdf = cudf.DataFrame()
keith_gdf['col1'] = np.random.randint(0, 2, size=size, dtype='int32')
keith_gdf['col2'] = np.random.randint(max_int, size=size, dtype='int32')
keith_gdf['col3'] = np.random.randint(max_int, size=size, dtype='int32')

output = keith_gdf.groupby('col1').count()
print(output)

When timing with the Linux time command, a “good” run takes ~1.25s walltime and a “bad” run takes ~44s walltime.

CC @harrism

jrhemstad on May 14, 2019

@MikeChenfu glad to hear the workaround worked well for you. I’m going to keep this open until we resolve the performance issues in using a MultiIndex.

kkraus14 on May 10, 2019

@MikeChenfu Can you try passing as_index=False as another parameter to the groupby call? We’re working on resolving the MultiIndex performance issues and that should workaround it for the time being.

kkraus14 on May 10, 2019