cudf: [BUG] Intermittent failures in `groupby` cumulative scans when keys contain nulls.
Describe the bug
Running
from itertools import count
import cudf
import dask_cudf
import numpy as np
import rmm
from cudf.testing._utils import assert_eq
if __name__ == "__main__":
state = np.random.get_state()
rmm.reinitialize(pool_allocator=False)
for i in count():
oldstate = np.random.get_state()
size = 10_000
gdf_original = cudf.DataFrame(
{
"xx": np.random.randint(0, 5, size=size),
"x": np.random.normal(size=size),
"y": np.random.normal(size=size),
},
)
# insert nulls into the key column at random.
gdf_original["xx"] = gdf_original.xx.mask(
np.random.choice([True, False], size=gdf_original.xx.shape)
)
pdf = gdf_original.to_pandas(nullable=False)
gdf = gdf_original
gdf_grouped = gdf.groupby("xx")
cudf_result = gdf_grouped.cumsum()
# Although we don't look at the data, this seems pretty
# crucial to provoking the issue.
# Notice this never touches the gdf_original data! And the
# computation is _done_ by the time we're here, so one
# suspicion is that there is garbage left lying around for the
# next iteration.
ddf = dask_cudf.from_cudf(cudf.from_pandas(pdf), npartitions=5).persist()
ddf_grouped = ddf.groupby("xx")
dask_cudf_result = ddf_grouped.cumsum().compute(scheduler="sync")
pandas_result = pdf.groupby("xx").cumsum()
print(i)
# This occasionally fails
assert_eq(cudf_result, pandas_result, check_dtype=False)
# This was for checking when we removed the reindex call in groupby._mimic_pandas_order
# mask = ~pdf.xx.isna()
# pandas_result_no_nulls = pandas_result.loc[mask]
# assert_eq(cudf_result.sort_index(), pandas_result_no_nulls, check_dtype=False)
with a recent enough cudf nightly sometimes produces assertion errors when checking correctness. The way this exhibits is that a few entries in the grouped dataframe columns are marked as NULL when they should not be. Post-mortem debugging, if one re-executes the offending bad code it tends to produce the correct result. One normally has to run a few times, interrupting the script to see the failure.
This has sometimes been causing the nightly actions to fail, the first is https://github.com/rapidsai/cudf/actions/runs/5065561229/jobs/9094279670 which is the first nightly that contained #13372.
Some existing investigation with @shwina provides the following information:
-
#13389 fixed this code so that it would run at all, and introduces a
dataframe.reindexcall inside_mimic_pandas_order. If we remove thereindexcall (applying this patchdiff --git a/python/cudf/cudf/core/groupby/groupby.py b/python/cudf/cudf/core/groupby/groupby.py index b7faed1dfc..e1a84897f4 100644 --- a/python/cudf/cudf/core/groupby/groupby.py +++ b/python/cudf/cudf/core/groupby/groupby.py @@ -13,6 +13,7 @@ import numpy as np import pandas as pd import cudf +import cudf._lib as libcudf from cudf._lib import groupby as libgroupby from cudf._lib.null_mask import bitmask_or from cudf._lib.reshape import interleave_columns @@ -2290,12 +2291,12 @@ class GroupBy(Serializable, Reducible, Scannable): ri = cudf.RangeIndex(0, len(self.obj)) result.index = cudf.Index(ordering) # This reorders and expands - result = result.reindex(ri) + # result = result.reindex(ri) else: # Just reorder according to the groupings result = result.take(ordering.argsort()) - # Now produce the actual index we first thought of - result.index = self.obj.index + # Now produce the actual index we first thought of + result.index = self.obj.index return resultAnd uncommenting the alternate error checking code in the bug script, we do not see failures.
-
reindexgoes throughjoinand hencegather. However, we also tried reimplementing the reordering usingscatter, like so:diff --git a/python/cudf/cudf/core/groupby/groupby.py b/python/cudf/cudf/core/groupby/groupby.py index b7faed1dfc..64d9ef2a8f 100644 --- a/python/cudf/cudf/core/groupby/groupby.py +++ b/python/cudf/cudf/core/groupby/groupby.py @@ -13,6 +13,7 @@ import numpy as np import pandas as pd import cudf +import cudf._lib as libcudf from cudf._lib import groupby as libgroupby from cudf._lib.null_mask import bitmask_or from cudf._lib.reshape import interleave_columns @@ -2287,10 +2288,18 @@ class GroupBy(Serializable, Reducible, Scannable): # Scan aggregations with null/nan keys put nulls in the # corresponding output rows in pandas, to do that here # expand the result by reindexing. - ri = cudf.RangeIndex(0, len(self.obj)) - result.index = cudf.Index(ordering) - # This reorders and expands - result = result.reindex(ri) + null_result_columns = [ + cudf.core.column.column_empty_like( + c, masked=True, newsize=len(self.obj) + ) + for c in result._data.columns + ] + new_result_columns = libcudf.copying.scatter( + [*result._data.columns], ordering, null_result_columns + ) + result = result._from_columns_like_self( + new_result_columns, column_names=result._column_names + ) else: # Just reorder according to the groupings result = result.take(ordering.argsort())and still observe the bug.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (16 by maintainers)
That seems to do the trick for me, @davidwendt
I can confirm this appears to fix the bug for me, too. Let’s open a PR. @davidwendt Would you do the honors?
@shwina I wasn’t able to reproduce that RuntimeError with a single null, but @wence-'s reproducer that alternates null/valid does fail consistently after a small number of iterations (usually 2 or 4) for me.
Let’s consider the calling context for
single_lane_block_sum_reduce(invalid_if_n_kernel):While the
__syncthreadscall ensures that in the same iteration invalid_if_n_kernelthat there is no data-race between the write on line 98 and read on line 106, I don’t think that is sufficient to ensure no data-races between iterations because without further (outside) synchronisation, the previous iteration’s read on line 106 can race with the current iteration’s write on line 98.Applying this patch:
Makes the reproducer racecheck-clean, good! But doesn’t fix the original issue, bad!