cudf: [BUG] nan_as_null parameter affects output of sort_values.
Describe the bug
nan_as_null parameter affects output of sort_values.
Steps/Code to reproduce bug
In [22]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=True)})
In [23]: print(df.sort_values(by='a'))
a
5 0.0
1 1.0
3 2.0
0
2
4
In [19]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=False)})
In [20]: print(df.sort_values(by='a'))
a
0 nan
1 1.0
2 nan
3 2.0
4 nan
5 0.0
similar issues with methods using libcudf APIs. Eg. drop_duplicates (which uses sorting)
df = cudf.DataFrame({'a': cudf.Series([1.0, np.nan, 0, np.nan, 0, 1], nan_as_null=False)})
In [10]: print(df)
a
0 1.0
1 nan
2 0.0
3 nan
4 0.0
5 1.0
In [11]: print(df.drop_duplicates())
a
0 1.0
1 nan
2 0.0
3 nan
4 0.0
5 1.0
Expected behavior
For sorting, drop_duplicates, nan should be considered equal.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of cuDF install: from source
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 48 (46 by maintainers)
0 and -0 are equal. It retains input order (indicating stability).
Note that Spark and Pandas have different semantics wrt. NaNs. In Spark, NaNs and nulls are not semantically equivalent. NaN is a value that is not a number, null is not a value.
Therefore I don’t see how replacing NaNs with nulls would be useful for Spark. This would lead to false positives in equality comparisons when cplumns containing NaNs are compared with columns containing nulls and incorrect bucketing of NaNs with nulls in sorting and groupby operations. I only see
find_and_replace_allbeing useful to normalize values like transforming -0 --> 0 and -NaN --> NaN before calling the functions for equality/sorting/grouping when normalization is desired.We currently have both
find_and_replace_alland as of 0.9,nans_to_nulls.Is it OK to require the end user (e.g. cuDF/Pandas/Spark user) to insert a conversion pass when they know they need it? I can imagine a data scientist getting wrong results and realizing they have NaN entries. They could run nans_to_nulls and then re-run to get correct results. But is that acceptable?
Or would cuDF Python and cuDF Spark be required to insert
nans_to_nullscalls on every floating point input column to algorithms that compare values (sort, unique, etc.)?I think this is the key question.