pandas: Missing values proposal: concrete steps for 1.0

Updated with to do list:

Implement pd.NA scalar -> https://github.com/pandas-dev/pandas/pull/29597
Basic BooleanArray -> https://github.com/pandas-dev/pandas/pull/29555
Use pd.NA in BooleanArray -> https://github.com/pandas-dev/pandas/pull/29961
- Implement kleene-logic in logical ops on BooleanArray -> https://github.com/pandas-dev/pandas/pull/29842
- Update the behaviour of any/all reductions with skipna=False (https://github.com/pandas-dev/pandas/issues/29686) -> https://github.com/pandas-dev/pandas/pull/30062
Use BooleanArray in comparison ops of StringArray -> https://github.com/pandas-dev/pandas/pull/30231
Use pd.NA in IntegerArray -> https://github.com/pandas-dev/pandas/pull/29964
Use BooleanArray as the return value for logical ops in IntegerArray -> https://github.com/pandas-dev/pandas/pull/29964
Enable boolean indexing with BooleanArray ( https://github.com/pandas-dev/pandas/issues/28778/) -> https://github.com/pandas-dev/pandas/pull/30308
Use BooleanArray as the return value for boolean .str methods. -> https://github.com/pandas-dev/pandas/pull/30239
Implement NA.__array_ufunc__ -> https://github.com/pandas-dev/pandas/pull/30245
Base class for IntegerArray & BooleanArray -> https://github.com/pandas-dev/pandas/pull/30789
Ensure everything is properly documented

Original issue:

Issue to discuss the implementation strategy for https://github.com/pandas-dev/pandas/issues/28095. Opening a new issue, as the other one already has a lot of discussion in several discussion, and would propose to keep this one focused on the practical aspects of how to implement this (regardless of certain aspects of the NA proposal such as single NA vs dtype-specific NAs -> for that will post a summary of the discussion on #28095 tomorrow).

I would like to propose the following way forward:

On the short term (ideally for 1.0):

Already implement and provide the pd.NA scalar, and recognize it in the appropriate places as missing value (e.g. pd.isna). This way, it can already be used in external ExtentionArrays
Implement a BooleanArray with support for missing values and appropriate NA behaviour. To start, we can just use a numpy masked array approach (similar to the existing IntegerArray), not involving any pyarrow memory optimizations.
Start using this BooleanArray as the boolean result of comparison operations for IntegerArray/StringArray (breaking change for nullable integers)
- Other arrays will keep using the numpy bool, this means we have two “boolean” dtypes side by side with different behaviour, and which one you get depends on the original data type (potentially confusing for users)
Start using pd.NA as the missing value indicator for Integer/String/BooleanArray (breaking change for nullable integers)

On the intermediate term (after 1.0)

Investigate if it can be implemented optionally for other data types and “activated” to have users opt-in for existing dtypes (to be further thought out).

I think the main discussion point is if we are OK with such a breaking change for IntegerArray. I would personally do this: IntegerArray was only introduced recently, still regarded as experimental, and the perfect use case for those changes. But, it’s certainly a clear backwards incompatible, breaking change.

cc @pandas-dev/pandas-core

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 33 (31 by maintainers)

Commits related to this issue

StringArray comparisions return BooleanArray xref https://github.com/pandas-dev/pandas/issues/29556 — committed to TomAugspurger/pandas by TomAugspurger 5 years ago
StringArray comparisions return BooleanArray (#30231) xref https://github.com/pandas-dev/pandas/issues/29556 — committed to pandas-dev/pandas by TomAugspurger 5 years ago
StringArray comparisions return BooleanArray (#30231) xref https://github.com/pandas-dev/pandas/issues/29556 — committed to proost/pandas by TomAugspurger 5 years ago
StringArray comparisions return BooleanArray (#30231) xref https://github.com/pandas-dev/pandas/issues/29556 — committed to proost/pandas by TomAugspurger 5 years ago

Most upvoted comments

It’s indeed already an optimization compared to object dtype array of True/False/None. But I think Jeff meant the potential optimization of using pyarrow’s boolean array storage, which can give a lot of additional memory improvement (8x compared to the double numpy array):

In [3]: mask = np.array([True] + [False] * 1000) 

In [4]: 2 * mask.nbytes 
Out[4]: 2002

In [5]: import pyarrow

In [6]: pyarrow.array(mask, mask=mask).nbytes
Out[6]: 252

But anyway, as I argued above, let’s leave this for the future, and focus now on the API using numpy arrays under the hood (that use some more memory).

jorisvandenbossche on Nov 12, 2019