cudf: [BUG] Selection with Null and NA data

When filtering a dataframe which contains NaN values cuDF fails

cdf[cdf.passenger_count > 2]

The above fails if cdf contains Null values

In [1]: import cudf
!head
In [2]: !cat head-nyctaxi.csv
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount

1,2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.30,1,N,263,161,1,12.5,0,0.5,2,0,0.3,15.3
1,2017-01-09 11:32:27,2017-01-09 11:36:01,1,.90,1,N,186,234,1,5,0,0.5,1.45,0,0.3,7.25
1,2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.10,1,N,164,161,1,5.5,0,0.5,1,0,0.3,7.3
1,2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.10,1,N,236,75,1,6,0,0.5,1.7,0,0.3,8.5
2,2017-01-01 00:00:00,2017-01-01 00:00:00,1,.02,2,N,249,234,2,52,0,0.5,0,0,0.3,52.8
1,2017-01-01 00:00:02,2017-01-01 00:03:50,1,.50,1,N,48,48,2,4,0.5,0.5,0,0,0.3,5.3
2,2017-01-01 00:00:02,2017-01-01 00:39:22,4,7.75,1,N,186,36,1,22,0.5,0.5,4.66,0,0.3,27.96
1,2017-01-01 00:00:03,2017-01-01 00:06:58,1,.80,1,N,162,161,1,6,0.5,0.5,1.45,0,0.3,8.75

In [3]: cdf = cudf.read_csv('head-nyctaxi.csv')

In [4]: cdf.head().to_pandas()
Out[4]:
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  trip_distance  RatecodeID  store_and_fwd_flag  PULocationID  DOLocationID  payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount
0       NaN                  NaT                   NaT               -1            NaN          -1                  -1            -1            -1            -1          NaN    NaN      NaN         NaN            -1                    NaN           NaN
1       1.0  2017-01-09 11:13:28   2017-01-09 11:25:45                1            3.3           1             2313200           263           161             1         12.5    0.0      0.5        2.00             0                    0.3         15.30
2       1.0  2017-01-09 11:32:27   2017-01-09 11:36:01                1            0.9           1             2313200           186           234             1          5.0    0.0      0.5        1.45             0                    0.3          7.25
3       1.0  2017-01-09 11:38:20   2017-01-09 11:42:05                1            1.1           1             2313200           164           161             1          5.5    0.0      0.5        1.00             0                    0.3          7.30
4       1.0  2017-01-09 11:52:13   2017-01-09 11:57:36                1            1.1           1             2313200           236            75             1          6.0    0.0      0.5        1.70             0                    0.3          8.50

In [5]: cdf[cdf.passenger_count > 2]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-cb090dcb433a> in <module>
----> 1 cdf[cdf.passenger_count > 2]

~/GitRepos/cudf/python/cudf/dataframe/dataframe.py in __getitem__(self, arg)
    241                 index = self.index.take(selinds.to_gpu_array())
    242                 for col in self._cols:
--> 243                     df[col] = Series(self._cols[col][arg], index=index)
    244                 df.set_index(index)
    245             else:

~/GitRepos/cudf/python/cudf/dataframe/series.py in __getitem__(self, arg)
    217             elif arg.dtype in [np.bool, np.bool_]:
    218                 selvals, selinds = columnops.column_select_by_boolmask(
--> 219                     self._column, arg)
    220                 index = self.index.take(selinds.to_gpu_array())
    221             else:

~/GitRepos/cudf/python/cudf/dataframe/columnops.py in column_select_by_boolmask(column, boolmask)
    107     """
    108     from .numerical import NumericalColumn
--> 109     assert column.null_count == 0  # We don't properly handle the boolmask yet
    110     boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())
    111     indices = cudautils.arange(len(boolmask))

AssertionError:

I briefly chatted with @kkraus14 about this and Null/NA values are special cased. Note: Pandas does not have a concept of Null values an instead uses np.nan.

@kkraus14 also suggested I look at what DBs do when filtering Nulls below is an example.
TLDR when filtering a column which contains Nulls, the null values are not included in any binary operation:

postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
(2 rows)

DB Setup

conda install postgresql

initdb -D ~/tmp-postgres-data
postgres -D ~/tmp-postgres-data
createuser --superuser postgres


CREATE TABLE CUSTOMERS(
   ID   INT              NOT NULL,
   NAME VARCHAR (20)     NOT NULL,
   AGE  INT              NOT NULL,
   ADDRESS  CHAR (25) ,
   SALARY   DECIMAL (18, 2),
   PRIMARY KEY (ID)
);


INSERT INTO CUSTOMERS (ID, NAME, AGE, ADDRESS, SALARY)
VALUES
 (1, 'James Dean', 32, 'Detroit', 200.0),
 (2, 'James McGhan', 33, 'Los Angeles', 300.0),
 (3, 'Empty Salary', 33, 'New York', NULL);

postgres=# select * from CUSTOMERS;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
  3 | Empty Salary |  33 | New York                  |
(3 rows)

postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
(2 rows)

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 16 (6 by maintainers)

Most upvoted comments

It’s starting to look like “better null support” might be a theme for 0.7

On Fri, Mar 1, 2019 at 2:58 PM Keith Kraus notifications@github.com wrote:

This is related to #902 https://github.com/rapidsai/cudf/pull/902 which will allow us to move the boolean masking functionality to the libcudf side.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rapidsai/cudf/issues/991#issuecomment-468840128, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszD9eLgAJxqMo9q6VBY2GTzpav54Vks5vSbCEgaJpZM4bDoXG .

mrocklin on Mar 1, 2019

I think that we can deviate from Pandas from time to time. Null handling is something that Pandas perhaps sees as a wart and they seem to be moving towards nullable columns in 0.24. I suggest that we defer to SQL semantics here if possible.

mrocklin on Mar 1, 2019

if cudf-py is trying to be a drop-in replacement for pandas then I would say it should do the pandas things

selections ignore None/Nulls
aggregations ignore None/ulls

In [4]: import pandas as pd 
   ...: df = pd.DataFrame({'x': [1, 2, None, 4], 'y': [1, 2, 3, 4]}) 
   ...: df[df.x > 2] 
   ...:                                                                                                                                                                                 
Out[4]: 
     x  y
3  4.0  4

In [5]: df.x.sum()                                                                                                                                                                      
Out[5]: 7.0

quasiben on Mar 1, 2019