cudf: [FEA] Groupby MIN/MAX with NaN values does not match what Spark expects

Describe the bug Running min aggregate on a table returns the NaN value as its long value instead of the literal “nan” as it does for the other aggregates. I haven’t gotten around to writing a unit test for this but can do if so required

Steps/Code to reproduce bug Create the following table

scala> spark.sql(""select * from floatsAndDoubles"").show
+-----+------+
|float|double|
+-----+------+
|  NaN|   NaN|
| 1.02|   NaN|
|  NaN|   4.5|
+-----+------+

running an aggregate(min) op on the double column will result in the following table

+----------+-----------+
| float    |min(double)|
+----------+-----------+
| 1.020000 | 179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 |
| NaN      | 4.500000  |
+----------+-----------+

Expected behavior It should output this

scala> spark.sql(""select float, min(double) from floatsAndDoubles group by float"").show
+-----+-----------+
|float|min(double)|
+-----+-----------+
| 1.02|        NaN|
|  NaN|        4.5|
+-----+-----------+

Additional context For context here is what aggregate(sum) does in cudf

+------+-----+
|float | sum |
+------+-----+
| 1.02 | NaN |
| NaN  | NaN |
+------+-----+

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 19 (17 by maintainers)

Most upvoted comments

The example was very confusing until I realized you were describing a groupby aggregation.