cudf: [FEA] Groupby MIN/MAX with NaN values does not match what Spark expects
Describe the bug Running min aggregate on a table returns the NaN value as its long value instead of the literal “nan” as it does for the other aggregates. I haven’t gotten around to writing a unit test for this but can do if so required
Steps/Code to reproduce bug Create the following table
scala> spark.sql(""select * from floatsAndDoubles"").show
+-----+------+
|float|double|
+-----+------+
| NaN| NaN|
| 1.02| NaN|
| NaN| 4.5|
+-----+------+
running an aggregate(min) op on the double column will result in the following table
+----------+-----------+
| float |min(double)|
+----------+-----------+
| 1.020000 | 179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 |
| NaN | 4.500000 |
+----------+-----------+
Expected behavior It should output this
scala> spark.sql(""select float, min(double) from floatsAndDoubles group by float"").show
+-----+-----------+
|float|min(double)|
+-----+-----------+
| 1.02| NaN|
| NaN| 4.5|
+-----+-----------+
Additional context For context here is what aggregate(sum) does in cudf
+------+-----+
|float | sum |
+------+-----+
| 1.02 | NaN |
| NaN | NaN |
+------+-----+
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 19 (17 by maintainers)
The example was very confusing until I realized you were describing a groupby aggregation.