scikit-learn: Add handle_missing and handle_unknown options to OrdinalEncoder

category_encoders.ordinal.OrdinalEncoder in scikit-learn-contrib/category_encoders has 2 really useful options:

  1. handle_unknown, options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which will impute the category -1.
  2. handle_missing, options are ‘error’, ‘return_nan’, and ‘value, default to ‘value’, which treat nan as a category at fit time, or -2 at transform time if nan is not a category during fit.

These 2 options are really, really useful for handling real-world data

Describe the workflow you want to enable

  1. Handle new categories at predict time in OrdinalEncoder (OneHotEncoder already has this opion).
  2. Handle NaNs at fit and predict time in OrdinalEncoder

Describe your proposed solution

Port the logic for handle_unknown and handle_missing from category_encoders.ordinal.OrdinalEncoder

Describe alternatives you’ve considered, if relevant

Just using scikit-learn-contrib/category_encoders instead

Additional context

Every encoder in scikit-learn-contrib/category_encoders has the option handle_unknown and handle_missing, giving users the flexibility to decide how to handle unknown or new values. This consistency in the API makes it really easy to switch between different encoders and try them out in your workflow.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 11
  • Comments: 19 (19 by maintainers)

Most upvoted comments

With 1.1, OrdinalEncoder now handles unknown values and missing values.

no need to IMO, we can just wait for #18968 to be solved for now.

I think I like @zachmayer’s proposal above. Not sure if the handle_missing=use_encoded_value option should be implemented as part of the same PR as the passthrough case of #18968, though.

Furthermore I find the name missing_value is a bit ambiguous, as one might think it refers to the value used as missing value marker in the original input data (to allow the user to consider something different from None, np.nan or pd.NA as a missing value). Maybe the term encoded_missing_value would be more explicit?

@NicolasHug what about adding use_encoded_value as an option to the handle_missing parameter, and then add missing_value as a new parameter?

And then if I wanted to use -1/-2 I could set the unknown_value / missing_value parameters, but sklearn wouldn’t make any decisions on what the implicit ordering should be?

I really like the parameters category_encoders.ordinal.OrdinalEncoder

handle_missing: str
options are ‘error’, ‘return_nan’, and ‘value, default to ‘value’, which treat nan as a category at fit time, or -2 at transform time if nan is not a category during fit.

This gives the option to error (currently sklearn behaivor), return nan (@NicolasHug 's suggestion) or return -2 (which can be useful as a way of imputing missing values with OrdinalEncoders.

Personally, I often encode new categories as -2 and missing values as -1. This is kinda nice, because it basically means new categories at predict time will be handled the same as missing categories at training time in tree-based models.

(edit) [I had -1/-2 backwards originally lol]

I’d be interested in such features… But I’m sure there are other issues discussing them, if not open pull requests. So just make sure you’re not conflicting with other work. Thanks