pandas: DISCUSS: boolean dtype with missing value support

Part of the discussion on missing value handling in https://github.com/pandas-dev/pandas/issues/28095, detailed proposal at https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB.

if we go for a new NA value, we also need to decide the behaviour of this value in comparison operations. And consequently, we also need to decide on the behaviour of boolean values with missing data in logical operations and indexing operations.
So let’s use this issue for that part of the discussion.

Some aspects of this:

  • Behaviour in comparison operations: currently np.nan compares unequal (value == np.nan -> False, values > np.nan -> False, but we can also propagate missing values (value == NA -> NA, …)
  • Behaviour in logical operations: currently we always return False for | or & with missing data. But we could also use a “three-valued logic” like Julia and SQL (this has, eg, NA | True = True or NA & True = NA).
  • Behaviour in indexing: currently you cannot do boolean indexing with a boolean series with missing values (which is object dtype right now). Do we want to change this? For example, interpret it as False (not select it) (TODO: should check how other languages do this)

Julia has a nice documentation page explain how they support missing values, the above ideas largely match with that.

Besides those behavioural API discussions, we also need to decide on how to approach this technically (boolean ExtensionArray with boolean numpy array + mask for missing values?) Shall we discuss that here as well, or keep that separate?

cc @pandas-dev/pandas-core

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (22 by maintainers)

Most upvoted comments

Without complexities of implementation in mind, I am not sure that we actually would want such propagation of missing values? My plan was to post here a summary of our discussion in the chat about this, but never got to it… sorry about that. But I seem to remember that propagation was actually the least favoured option of the three? (our notes also don’t say much …)

I think the main take-away of the discussion was that there is not a clear “best” option. Raising an error is the most “conservative”, in the sense that we will never do the wrong thing automatically, and the user always needs to specify with fillna(True/False) what they want. This is very explicit, but can also get annoying if in 95% of the cases you always want fillna(False). Skipping the NAs (which means doing a fillna(False) implicitly) might be what most people want / expect most of the time. But, it is less explicit, and when you didn’t expect it / want something else, it can maybe also be very confusing if this happened automatically.

For the question around the indexing behaviour with boolean values in the presence of NAs, I think there are 3 options:

  • Raise an exception
  • Don’t include in the output (interpret NA as False in the filtering operation)
  • Propagate NAs (NA in mask gives NA in output)

I looked at some other languages / libraries that deal with this.

Postgres (SQL) filters only where True (thus interprets NULL as False in the filtering operation):

CREATE TABLE test_types (
    col1    integer,
    col2    integer
);

INSERT INTO test_types VALUES (1, 1);
INSERT INTO test_types VALUES (2, NULL);
INSERT INTO test_types VALUES (3, 3);
test_db=# SELECT col1, col2 > 2 AS mask FROM test_types2;
 col1 | mask 
------+------
    1 | f
    2 | 
    3 | t
(3 rows)

test_db=# SELECT * FROM test_types2 WHERE col2 > 2;
 col1 | col2 
------+------
    3 |    3
(1 row)

In R, it depends on function. dplyr’s filter drops NAs: Unlike base subsetting with [, rows where the condition evaluates to NA are dropped. (from https://dplyr.tidyverse.org/reference/filter.html). Example:

> df <- tibble(col1 = c(1L, 2L, 3L), col2 = c(1L, NA, 3L))
> df
# A tibble: 3 x 2
   col1  col2
  <int> <int>
1     1     1
2     2    NA
3     3     3
> df %>% mutate(mask = col2 > 2)
# A tibble: 3 x 3
   col1  col2 mask 
  <int> <int> <lgl>
1     1     1 FALSE
2     2    NA NA   
3     3     3 TRUE 
> df %>% filter(col2 > 2)
# A tibble: 1 x 2
   col1  col2
  <int> <int>
1     3     3

But so in base R, it propagates NAs (missing value in the index always yields a missing value in the output, from https://adv-r.hadley.nz/subsetting.html):

> x <- c(1, 2, 3)
> mask <- c(FALSE, NA, TRUE)
> x[mask]
[1] NA  3

Julia currently raises an error (but not sure if this is on purpose or just not yet implemented. EDIT: based on https://julialang.org/blog/2018/06/missing this seems to be on purpose):

julia> arr = [1 2 3]
1×3 Array{Int64,2}:
 1  2  3

julia> mask = [false missing true]
1×3 Array{Union{Missing, Bool},2}:
 false  missing  true

julia> arr[mask]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
Stacktrace:
 [1] checkindex(::Type{Bool}, ::Base.OneTo{Int64}, ::Missing) at ./abstractarray.jl:504
 [2] checkindex at ./abstractarray.jl:519 [inlined]
 [3] checkbounds at ./abstractarray.jl:434 [inlined]
 [4] checkbounds at ./abstractarray.jl:449 [inlined]
 [5] _getindex at ./multidimensional.jl:596 [inlined]
 [6] getindex(::Array{Int64,2}, ::Array{Union{Missing, Bool},2}) at ./abstractarray.jl:905
 [7] top-level scope at none:0

Apache Arrow C++ (pyarrow) has currently the same behaviour as base R (propagating):

In [2]: import pyarrow as pa 

In [4]: arr = pa.array([1, 2, 3])

In [5]: mask = pa.array([False, None, True])

In [6]: mask
Out[6]: 
<pyarrow.lib.BooleanArray object at 0x7fa0f1b9f768>
[
  false,
  null,
  true
]

In [7]: arr.filter(mask) 
Out[7]: 
<pyarrow.lib.Int64Array object at 0x7fa0f1b9fd68>
[
  null,
  3
]