pandas: ROADMAP: Consistent missing value handling with new NA scalar
I cleaned up my initial write up on the consistent missing values proposal (https://github.com/pandas-dev/pandas/issues/27825#issuecomment-520583911), and incorporated the items brought up in the last video chat. So I think it is ready for some more detailed discussion.
The last version of the full proposal can be found here: https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB
TL;DR:
- I propose to introduce a new scalar (singleton)
pd.NA
that can be used as the missing value indicator (when accessing a single value, not necessarily how it is stored under the hood). - This can be used instead of
np.nan
orpd.NaT
in new data types (eg nullable integers, potential string dtype) - Long term, we can see if there is a migration possible to use this consistently for all data types.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 3
- Comments: 65 (65 by maintainers)
Commits related to this issue
- MAINT: some hacks to fix pypy support — committed to numpy/numpy by ngoldbaum 7 months ago
I think we need to settle on having a single
pd.NA
or apd.NA[T]
, one per dtype, before we’re ready to open a PR.I think this is the strongest argument for
pd.NA[T]
, rather than a singlepd.NA
. An aggregation or a scalar selection losing the information of whether or not an op should raise seems important.Couple of comments:
Series + pandas.NA
returns the type ofSeries
.NA
dtype. I thinkpandas.Series([pandas.NA, pandas.NA])
should fail unlessdtype
is specified.NaN
andNaT
, when present, should be copied to the mask, and then we can forget about them (for what I understand values withTrue
in the NA mask won’t be ever used).I’d say it’s time to open a PR, and continue the discussion there. This thread starts to be too long to see what has been discussed or where there is agreement. If there are controversial parts of this may be we can have some question marks in the roadmap, and discuss them again when there is work and we can see more in practice all what’s being discussed.