scikit-learn: Add handle_missing and handle_unknown options to OrdinalEncoder
category_encoders.ordinal.OrdinalEncoder in scikit-learn-contrib/category_encoders has 2 really useful options:
handle_unknown
, options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which will impute the category -1.handle_missing
, options are ‘error’, ‘return_nan’, and ‘value, default to ‘value’, which treat nan as a category at fit time, or -2 at transform time if nan is not a category during fit.
These 2 options are really, really useful for handling real-world data
Describe the workflow you want to enable
- Handle new categories at predict time in OrdinalEncoder (OneHotEncoder already has this opion).
- Handle NaNs at fit and predict time in OrdinalEncoder
Describe your proposed solution
Port the logic for handle_unknown
and handle_missing
from category_encoders.ordinal.OrdinalEncoder
Describe alternatives you’ve considered, if relevant
Just using scikit-learn-contrib/category_encoders instead
Additional context
Every encoder in scikit-learn-contrib/category_encoders
has the option handle_unknown
and handle_missing
, giving users the flexibility to decide how to handle unknown or new values. This consistency in the API makes it really easy to switch between different encoders and try them out in your workflow.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 11
- Comments: 19 (19 by maintainers)
With 1.1, OrdinalEncoder now handles unknown values and missing values.
no need to IMO, we can just wait for #18968 to be solved for now.
I think I like @zachmayer’s proposal above. Not sure if the
handle_missing=use_encoded_value
option should be implemented as part of the same PR as thepassthrough
case of #18968, though.Furthermore I find the name
missing_value
is a bit ambiguous, as one might think it refers to the value used as missing value marker in the original input data (to allow the user to consider something different fromNone
,np.nan
orpd.NA
as a missing value). Maybe the termencoded_missing_value
would be more explicit?@NicolasHug what about adding
use_encoded_value
as an option to thehandle_missing
parameter, and then addmissing_value
as a new parameter?And then if I wanted to use -1/-2 I could set the
unknown_value
/missing_value
parameters, but sklearn wouldn’t make any decisions on what the implicit ordering should be?I really like the parameters category_encoders.ordinal.OrdinalEncoder
This gives the option to error (currently sklearn behaivor), return nan (@NicolasHug 's suggestion) or return -2 (which can be useful as a way of imputing missing values with OrdinalEncoders.
Personally, I often encode new categories as
-2
and missing values as-1
. This is kinda nice, because it basically means new categories at predict time will be handled the same as missing categories at training time in tree-based models.(edit) [I had -1/-2 backwards originally lol]
I’d be interested in such features… But I’m sure there are other issues discussing them, if not open pull requests. So just make sure you’re not conflicting with other work. Thanks