cudf: [FEA] Behavior of apply_rows on dataframes with missing values
I’m not sure if this is intended behavior or not. Right now it’s a little annoying to use apply_rows on dataframes with missing values. This is the current behavior:
>>> cudf.__version__
'0.8.0+0.g8fa7bd3.dirty'
>>> df = cudf.DataFrame({'a': [1., 2., None, 3.]})
>>> print(df)
a
0 1.0
1 2.0
2
3 3.0
>>> def kernel(a, out):
... for i, val in enumerate(a):
... out[i] = val
>>> df = df.apply_rows(kernel, ['a'], {'out': float}, {})
>>> print(df)
a out
0 1.0 1.0
1 2.0 2.0
2 3.0
3 3.0 1.5e-323
You can see that the loop index in the kernel skipped the rows of a with missing values, but the
same index applied to out did not skip those rows. So the out rows are out of order. To get what I really want, I have to first mask out the missing values from a, apply the kernel, then apply it back to the original dataframe with some sort of merge. That is, I would want
a out
0 1.0 1.0
1 2.0 2.0
2
3 3.0 3.0
It would be nice if
- The automatic masking out of missing values in the kernel also applied to the output columns.
- The output column was initialized with missing values, rather than whatever happened to be in memory, or at least missing values were applied to
outfor the indices whereahas missing values.
I don’t know how missing values are implemented in cudf so it’s possible 2 might be considered too inefficient to do by default.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (5 by maintainers)
@kkraus14 I think that’s the simplest solution. I’d rather filter outputs we aren’t 100% sure are correct. To know 100%, we would need to refactor the input func before sending though Numba. Let’s enforce an “any row with a null in any column will result in null in all output columns” rule, and make sure the documentation is clear on that behavior.
Sorry, should have read more closely, didn’t realize this was for dataframes with multiple columns as opposed to just a single column. That is absolutely trickier because we don’t know which columns have ops ran against them or if they’re doing something crazier with conditionally using one column or another. For the time being I guess we could just do a bitwise logical or of all of the input columns masks and set that for all of the output columns, but that doesn’t sound ideal.