scikit-learn: Handle missing values in OneHotEncoder
A minimum implementation might translate a NaN in input to a row of NaNs in output. I believe this would be the most consistent default behaviour with respect to other preprocessing tools, and with reasonable backwards-compatibility, but other core devs might disagree (see https://github.com/scikit-learn/scikit-learn/issues/10465#issuecomment-394439632).
NaN should also be excluded from the categories identified in fit
.
A handle_missing
parameter might allow NaN in input to be:
- replaced with a row of NaNs as above
- replaced with a row of zeros
- represented with a separate one-hot column
in the output.
A missing_values
parameter might allow the user to configure what object is a placeholder for missingness (e.g. NaN, None, etc.).
See #10465 for background
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 23
- Comments: 17 (15 by maintainers)
Commits related to this issue
- Improved support for the 'OneHotEncoder' transformation type See https://github.com/scikit-learn/scikit-learn/issues/11996 — committed to jpmml/jpmml-sklearn by vruusmann 3 years ago
Perhaps:
handle_missing='all-missing'
:handle_missing='all-zero'
:handle_missing='category'
:A good idea might be to start by writing things other than the implementation:
I am also +1 for not supporting the option that would generate a row of nans, it sounds like YAGNI to me.
Let’s consider the following data case with a CSV file with 2 categorical columns, where one uses string labels and the other uses integer labels:
So by default pandas will use float64 dtype for the int-valued column so as to be able to use nan as the missing value marker.
It’s actually possible to use
SimpleImputer
with the constant strategy on this kind of heterogeneously typed data as it will convert it to a numpy array with object dtype:However putting string values in an otherwise float valued column is weird and causes the OneHotEncoder to crash on that column:
Using the debugger to see the underlying exception reveals:
One could use the column transformer to split the string valued categories from the number valued categorical columns and use suitable
fill_value
for constant imputing on each side.However from a usability standpoint it would make sense to have
OneHotEncoder
be able to directly to do constant imputation withhandle_missing="indicator"
.We could also implement the zero strategy with
handle_missing="zero"
. We need to decide about the default. missing_We also need to make sure that nan passed only at transform time (without being seen in this column at fit time) should be accepted (with the zero encoding) so that cross-validation is possible on data with just a few missing values that might end up all in the validation split by chance.
I’m not sure if having the row of NaNs is worth supporting. It seems to make this much trickier as well. I think the separate value makes a ton of sense, and if people don’t want that, they can use the imputer first. This should simplify the treatment in OHE.
Given my work on dabl, right now I’m more concerned with making things possible at all than making them very easy with sklearn.
What I found most annoying within this complex of things (and it’s only tangentially related but not sure which issue would be the correct one, #2888 maybe?) is that I can’t actually use the “constant” strategy on the categorical columns within a ColumnTransformer. My naive approach would be to separate continuous and categorical, fill the nans in categorical with the string “missing” and then do one-hot-encoding. However, that only works if the categorical variables were strings; otherwise SimpleImputer will raise an error. Alternatively we could use “most_frequent” in SimpleImputer and add a MissingValueIndicator, but that seems less intuitive to me: it would mean the most frequent feature category would be “1” and the “missing” feature would be “1” as well.
Any updates on this?
I can’t fit a dataset that contains ‘nan’ values.
I’m trying to confirm I understand the task very well. Help me run through this pseudocode and point out anywhere I might be wrong to me.
Add a
missing_values
parameter to the init method of the OneHotEncoder class. This parameter will allow users specify what should be taken as a missing value. Available options should be either:Add a
handle_missing
parameter to the init method of the OneHotEncoder class. This parameter will allow users specify what happens to missing values as specified by themissing_values
parameters. Available options should be either: