scikit-learn: Handle Error Policy in OrdinalEncoder

Preprocessor class OneHotEncoder allows transformation if unknown values are found. It would be great to introduce the same option to OrdinalEncoder. It seems simple to do since OrdinalEncoder (as well as OneHotEncoder) is derived from _BaseEncoder which actually implements handling error policy.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 16
  • Comments: 26 (14 by maintainers)

Most upvoted comments

Personally I find this issue really annoying. At the moment we cannot use OrdinalEncoder on data with a long-tailed distribution of categorical variable frequencies in a cross validation loops without triggering the unknown category exception at prediction time.

Friends, may you speed up this fix since in real live, there are often many differences between train and data test? as stated above by @ogrisel : Personally I find this issue really annoying

Can you do at least this as @daskol wrote above
I could ignore unknown catregiries value as OneHotEncoder does. Another possible scenario (say sentinel) could replace unknown value with default one which could be specified in OrdinalEncoder’s constructor

yes just ignore unknown catregiries for transform - replace unknown value with default one which could be specified in OrdinalEncoder’s constructor or set to none… but please do something!!!

We discussed the issue with @jorisvandenbossche and I think the sanest strategy would be to have:

  • min_frequency=5 (5 is an example, the default could be 1) to set the threshold to collapse all categories that appear less than 5 times in the training set into a virtual category
  • rare_category="rare_value" as a parameter to control the name of the virtual category used to map all the rare values. This will be mostly be useful for inverse_transform and for consistency if we implement a similar category collapsing option in other categorical variable preprocessors such as OneHotEncoder and ImpactEncoder / TargetEncoder (whatever its name).
  • handle_unknown="treat_as_rare" that would map any inference-time unknown categories to the integer that is mapped to the virtual rare_category (even when its never used on the training set).

IMO handle_unknown="treat_as_rare" is the sanest way to handle inference-time unknown values from a statistical point of view.

handle_unknown : ‘error’ or ‘ignore’, default=’error’.

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

I could ignore unknown catregiries value as OneHotEncoder does. Another possible scenario (say sentinel) could replace unknown value with default one which could be specified in OrdinalEncoder’s constructor.

No, #12264 is not my case but it is desirable too. I would like OrdinalEncoder do not throw an exception if it meets an unexpected value. I mean that unexpected value is value which categories does not contain.

Order is not important The idea is to make it practicable Since for now it is not practicable distilled case Meaning , in real data there are always new categories in test data, ( categories unseen in training data) Then ordinal encoder just crashes Pls add capabilities to handle cases when test data and training data have different categories Meaning Test data have categories that are not present in train data

also see #12153 for something related. I think min_frequency is good, but I also want max_levels or something like that. Basically we could reimplement all the different pruning options we have in CountVectorizer…

The policy of OnehotEncoder for unknown categories is to set all the columns to 0. That wouldn’t be great for OrdinalEncoder since unknown categories and the first one (i.e. zero) would be mixed.

Allowing a fallback category value seems reasonable to me.

@daskol could you please describe your particular use-case that motivates this change?

I think that this would be a really helpful change. Given the choice of a default virtual encoding for unseen values, I would rather it be the first value as opposed to the last value.

I feel that this is intuitive because you always know where the first value is without looking (e.g. if the default encoding were -1 or 0).

On a practical use case, I am converting categorical data to numeric data to use with the RandomForestRegressor.

Some of my categories are quite small (~5 types) while some are larger (~5000 types).

I would like to use the OneHotEncoder on my smaller types because it will make my features more interpretable when I look at their permutation importance.

I would like to use the OrdinalEncoder on my larger categories because it will make actions like permutation importance less computationally expensive and my model choice will be robust to the effect of the OrdinalEncoder’s choice of ordering the features.

However, I will no doubt encounter a number of examples in my test data for the larger categories that are not present in my training dataset, so if I use the OrdinalEncoder it will throw an error.

My other alternative is to do something hacky with Pandas or the DictVectorizer.

I totally agree @amueller. I was using it for preprocessing data before LightGBM, which requires integer data, but you can flag certain columns as categorical and so the order does NOT matter. I am confused about what “order matters” means because by its very nature, there is no correct order of the classes. I’m not sure that an “order matters” OrdinalEncoder makes sense to me, when no order is specified.

Hi! I wanted to add that it seems to me that the feature of allowing the encoder to be fit on data that contains some categories, and then applied to data that contains maybe an additional category or two, seems like a common use case for all kinds of categorical data. Even if one category isn’t very uncommon, if you’re doing a random split for training and validation it’s only a matter of time before this error comes up. I like the “min_frequency” solution for its generality, but to (naive) me, it seems too complicated. To me, it seems the default behavior should be to send all categories not present in the original fitting to a single virtual category. Or maybe a “create_virtual_category = True” option. If this is amenable, I’d be happy to take a crack at making it, I’m trying to spend more time working on open source code!

If we add a “rare” category to OrdinalEncoder it seems like it goes against the “Ordinal” part of the encoder. It assumes the rare or unknown category has the highest (or lowest) value. If we do introduce this, it would be good to document this behavior.

Ideally, if OrdinalEncoder can handle most of the logic that has to deal with unknown and infrequent categories, OneHotEncoder would need to do less. (I am thinking of composition ie OneHotEncoder has a OrdinalEncoder)