scikit-learn: Handle Error Policy in OrdinalEncoder
Preprocessor class OneHotEncoder
allows transformation if unknown values are found. It would be great to introduce the same option to OrdinalEncoder
. It seems simple to do since OrdinalEncoder
(as well as OneHotEncoder
) is derived from _BaseEncoder
which actually implements handling error policy.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 16
- Comments: 26 (14 by maintainers)
Personally I find this issue really annoying. At the moment we cannot use
OrdinalEncoder
on data with a long-tailed distribution of categorical variable frequencies in a cross validation loops without triggering the unknown category exception at prediction time.Friends, may you speed up this fix since in real live, there are often many differences between train and data test? as stated above by @ogrisel : Personally I find this issue really annoying
Can you do at least this as @daskol wrote above
I could ignore unknown catregiries value as OneHotEncoder does. Another possible scenario (say sentinel) could replace unknown value with default one which could be specified in OrdinalEncoder’s constructor
yes just ignore unknown catregiries for transform - replace unknown value with default one which could be specified in OrdinalEncoder’s constructor or set to none… but please do something!!!
We discussed the issue with @jorisvandenbossche and I think the sanest strategy would be to have:
min_frequency=5
(5 is an example, the default could be 1) to set the threshold to collapse all categories that appear less than 5 times in the training set into a virtual categoryrare_category="rare_value"
as a parameter to control the name of the virtual category used to map all the rare values. This will be mostly be useful forinverse_transform
and for consistency if we implement a similar category collapsing option in other categorical variable preprocessors such asOneHotEncoder
andImpactEncoder
/TargetEncoder
(whatever its name).handle_unknown="treat_as_rare"
that would map any inference-time unknown categories to the integer that is mapped to the virtualrare_category
(even when its never used on the training set).IMO
handle_unknown="treat_as_rare"
is the sanest way to handle inference-time unknown values from a statistical point of view.I could ignore unknown catregiries value as
OneHotEncoder
does. Another possible scenario (saysentinel
) could replace unknown value with default one which could be specified inOrdinalEncoder
’s constructor.No, #12264 is not my case but it is desirable too. I would like
OrdinalEncoder
do not throw an exception if it meets an unexpected value. I mean that unexpected value is value whichcategories
does not contain.Order is not important The idea is to make it practicable Since for now it is not practicable distilled case Meaning , in real data there are always new categories in test data, ( categories unseen in training data) Then ordinal encoder just crashes Pls add capabilities to handle cases when test data and training data have different categories Meaning Test data have categories that are not present in train data
also see #12153 for something related. I think min_frequency is good, but I also want max_levels or something like that. Basically we could reimplement all the different pruning options we have in CountVectorizer…
The policy of
OnehotEncoder
for unknown categories is to set all the columns to 0. That wouldn’t be great forOrdinalEncoder
since unknown categories and the first one (i.e. zero) would be mixed.Allowing a fallback category value seems reasonable to me.
@daskol could you please describe your particular use-case that motivates this change?
I think that this would be a really helpful change. Given the choice of a default virtual encoding for unseen values, I would rather it be the first value as opposed to the last value.
I feel that this is intuitive because you always know where the first value is without looking (e.g. if the default encoding were -1 or 0).
On a practical use case, I am converting categorical data to numeric data to use with the RandomForestRegressor.
Some of my categories are quite small (~5 types) while some are larger (~5000 types).
I would like to use the OneHotEncoder on my smaller types because it will make my features more interpretable when I look at their permutation importance.
I would like to use the OrdinalEncoder on my larger categories because it will make actions like permutation importance less computationally expensive and my model choice will be robust to the effect of the OrdinalEncoder’s choice of ordering the features.
However, I will no doubt encounter a number of examples in my test data for the larger categories that are not present in my training dataset, so if I use the OrdinalEncoder it will throw an error.
My other alternative is to do something hacky with Pandas or the DictVectorizer.
I totally agree @amueller. I was using it for preprocessing data before LightGBM, which requires integer data, but you can flag certain columns as categorical and so the order does NOT matter. I am confused about what “order matters” means because by its very nature, there is no correct order of the classes. I’m not sure that an “order matters” OrdinalEncoder makes sense to me, when no order is specified.
Hi! I wanted to add that it seems to me that the feature of allowing the encoder to be fit on data that contains some categories, and then applied to data that contains maybe an additional category or two, seems like a common use case for all kinds of categorical data. Even if one category isn’t very uncommon, if you’re doing a random split for training and validation it’s only a matter of time before this error comes up. I like the “min_frequency” solution for its generality, but to (naive) me, it seems too complicated. To me, it seems the default behavior should be to send all categories not present in the original fitting to a single virtual category. Or maybe a “create_virtual_category = True” option. If this is amenable, I’d be happy to take a crack at making it, I’m trying to spend more time working on open source code!
If we add a “rare” category to
OrdinalEncoder
it seems like it goes against the “Ordinal” part of the encoder. It assumes the rare or unknown category has the highest (or lowest) value. If we do introduce this, it would be good to document this behavior.Ideally, if OrdinalEncoder can handle most of the logic that has to deal with unknown and infrequent categories, OneHotEncoder would need to do less. (I am thinking of composition ie
OneHotEncoder
has aOrdinalEncoder
)