cudf: [BUG] Selection with Null and NA data
When filtering a dataframe which contains NaN values cuDF fails
cdf[cdf.passenger_count > 2]
The above fails if cdf contains Null values
In [1]: import cudf
!head
In [2]: !cat head-nyctaxi.csv
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
1,2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.30,1,N,263,161,1,12.5,0,0.5,2,0,0.3,15.3
1,2017-01-09 11:32:27,2017-01-09 11:36:01,1,.90,1,N,186,234,1,5,0,0.5,1.45,0,0.3,7.25
1,2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.10,1,N,164,161,1,5.5,0,0.5,1,0,0.3,7.3
1,2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.10,1,N,236,75,1,6,0,0.5,1.7,0,0.3,8.5
2,2017-01-01 00:00:00,2017-01-01 00:00:00,1,.02,2,N,249,234,2,52,0,0.5,0,0,0.3,52.8
1,2017-01-01 00:00:02,2017-01-01 00:03:50,1,.50,1,N,48,48,2,4,0.5,0.5,0,0,0.3,5.3
2,2017-01-01 00:00:02,2017-01-01 00:39:22,4,7.75,1,N,186,36,1,22,0.5,0.5,4.66,0,0.3,27.96
1,2017-01-01 00:00:03,2017-01-01 00:06:58,1,.80,1,N,162,161,1,6,0.5,0.5,1.45,0,0.3,8.75
In [3]: cdf = cudf.read_csv('head-nyctaxi.csv')
In [4]: cdf.head().to_pandas()
Out[4]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 NaN NaT NaT -1 NaN -1 -1 -1 -1 -1 NaN NaN NaN NaN -1 NaN NaN
1 1.0 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.3 1 2313200 263 161 1 12.5 0.0 0.5 2.00 0 0.3 15.30
2 1.0 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.9 1 2313200 186 234 1 5.0 0.0 0.5 1.45 0 0.3 7.25
3 1.0 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.1 1 2313200 164 161 1 5.5 0.0 0.5 1.00 0 0.3 7.30
4 1.0 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.1 1 2313200 236 75 1 6.0 0.0 0.5 1.70 0 0.3 8.50
In [5]: cdf[cdf.passenger_count > 2]
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-5-cb090dcb433a> in <module>
----> 1 cdf[cdf.passenger_count > 2]
~/GitRepos/cudf/python/cudf/dataframe/dataframe.py in __getitem__(self, arg)
241 index = self.index.take(selinds.to_gpu_array())
242 for col in self._cols:
--> 243 df[col] = Series(self._cols[col][arg], index=index)
244 df.set_index(index)
245 else:
~/GitRepos/cudf/python/cudf/dataframe/series.py in __getitem__(self, arg)
217 elif arg.dtype in [np.bool, np.bool_]:
218 selvals, selinds = columnops.column_select_by_boolmask(
--> 219 self._column, arg)
220 index = self.index.take(selinds.to_gpu_array())
221 else:
~/GitRepos/cudf/python/cudf/dataframe/columnops.py in column_select_by_boolmask(column, boolmask)
107 """
108 from .numerical import NumericalColumn
--> 109 assert column.null_count == 0 # We don't properly handle the boolmask yet
110 boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())
111 indices = cudautils.arange(len(boolmask))
AssertionError:
I briefly chatted with @kkraus14 about this and Null/NA values are special cased. Note: Pandas does not have a concept of Null values an instead uses np.nan.
@kkraus14 also suggested I look at what DBs do when filtering Nulls below is an example.
TLDR when filtering a column which contains Nulls, the null values are not included in any binary operation:
postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
(2 rows)
DB Setup
conda install postgresql
initdb -D ~/tmp-postgres-data
postgres -D ~/tmp-postgres-data
createuser --superuser postgres
CREATE TABLE CUSTOMERS(
ID INT NOT NULL,
NAME VARCHAR (20) NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR (25) ,
SALARY DECIMAL (18, 2),
PRIMARY KEY (ID)
);
INSERT INTO CUSTOMERS (ID, NAME, AGE, ADDRESS, SALARY)
VALUES
(1, 'James Dean', 32, 'Detroit', 200.0),
(2, 'James McGhan', 33, 'Los Angeles', 300.0),
(3, 'Empty Salary', 33, 'New York', NULL);
postgres=# select * from CUSTOMERS;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
3 | Empty Salary | 33 | New York |
(3 rows)
postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
(2 rows)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (6 by maintainers)
It’s starting to look like “better null support” might be a theme for 0.7
On Fri, Mar 1, 2019 at 2:58 PM Keith Kraus notifications@github.com wrote:
I think that we can deviate from Pandas from time to time. Null handling is something that Pandas perhaps sees as a wart and they seem to be moving towards nullable columns in 0.24. I suggest that we defer to SQL semantics here if possible.
if cudf-py is trying to be a drop-in replacement for pandas then I would say it should do the pandas things