scikit-learn: MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead?
Description
I am not sure if it’s intended for MultiLabelBinarizer to fit and transform only seen data or not.
However, there are many times that it is not possible/not in our interest to know all of the classes that we’re fitting at training time. For convenience, I am wondering if there should be another parameter that allows us to ignore the unseen classes by just setting them to 0?
Proposed Modification
Example:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(ignore_unseen=True)
y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)
y_test = [['a'],['b'],['d']]
mlb.transform(y_test)
Result: array([[1, 0, 0], [0, 1, 0], [0, 0, 0]])
(the current version 0.19.0 would say KeyError: 'd'
)
I can open a PR for this if this is a desired behavior.
Others also have similar issue: https://stackoverflow.com/questions/31503874/using-multilabelbinarizer-on-test-data-with-labels-not-in-the-training-set
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 16 (6 by maintainers)
I am fully aware of that. But it is beside the point. If you want to use the MultiLabelBinarizer in e.g. a library your still in the position that you have to either:
It seems inconsistent to not have an ignore option in MultiLabelBinarizer when OneHotEncoder has the
handle_unknown
option. Conditionally muting warnings is bad practice.Seems like it would just be a matter of adding the option and changing this line: https://github.com/scikit-learn/scikit-learn/blob/fd237278e895b42abe8d8d09105cbb82dc2cbba7/sklearn/preprocessing/_label.py#L993 to
if unknown and handle_unknown=='error':