cudf: [BUG] nan_as_null parameter affects output of sort_values.

Describe the bug nan_as_null parameter affects output of sort_values.

Steps/Code to reproduce bug

In [22]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=True)})

In [23]: print(df.sort_values(by='a'))
     a
5  0.0
1  1.0
3  2.0
0
2
4
In [19]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=False)})

In [20]: print(df.sort_values(by='a'))
     a
0  nan
1  1.0
2  nan
3  2.0
4  nan
5  0.0

similar issues with methods using libcudf APIs. Eg. drop_duplicates (which uses sorting)

df = cudf.DataFrame({'a': cudf.Series([1.0, np.nan, 0, np.nan, 0, 1], nan_as_null=False)})

In [10]: print(df)
     a
0  1.0
1  nan
2  0.0
3  nan
4  0.0
5  1.0

In [11]: print(df.drop_duplicates())
     a
0  1.0
1  nan
2  0.0
3  nan
4  0.0
5  1.0

Expected behavior For sorting, drop_duplicates, nan should be considered equal.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: from source

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 48 (46 by maintainers)

Most upvoted comments

@karthikeyann why do you have -0 larger than 0 in all of your examples?

0 and -0 are equal. It retains input order (indicating stability).

NaN == NaN: find_and_replace_all NaN’s with null and nulls_are_equal=true flag for equality comparisons. NaN > +Inf: find_and_replace_all NaN’s with null and pass nulls_are_smallest=false flag for sorting.

Note that Spark and Pandas have different semantics wrt. NaNs. In Spark, NaNs and nulls are not semantically equivalent. NaN is a value that is not a number, null is not a value.

Therefore I don’t see how replacing NaNs with nulls would be useful for Spark. This would lead to false positives in equality comparisons when cplumns containing NaNs are compared with columns containing nulls and incorrect bucketing of NaNs with nulls in sorting and groupby operations. I only see find_and_replace_all being useful to normalize values like transforming -0 --> 0 and -NaN --> NaN before calling the functions for equality/sorting/grouping when normalization is desired.

We currently have both find_and_replace_all and as of 0.9, nans_to_nulls.

Is it OK to require the end user (e.g. cuDF/Pandas/Spark user) to insert a conversion pass when they know they need it? I can imagine a data scientist getting wrong results and realizing they have NaN entries. They could run nans_to_nulls and then re-run to get correct results. But is that acceptable?

Or would cuDF Python and cuDF Spark be required to insert nans_to_nulls calls on every floating point input column to algorithms that compare values (sort, unique, etc.)?

I think this is the key question.