scikit-learn: MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead?

Description

I am not sure if it’s intended for MultiLabelBinarizer to fit and transform only seen data or not.

However, there are many times that it is not possible/not in our interest to know all of the classes that we’re fitting at training time. For convenience, I am wondering if there should be another parameter that allows us to ignore the unseen classes by just setting them to 0?

Proposed Modification

Example:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(ignore_unseen=True)

y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)

y_test = [['a'],['b'],['d']]
mlb.transform(y_test)

Result: array([[1, 0, 0], [0, 1, 0], [0, 0, 0]])

(the current version 0.19.0 would say KeyError: 'd')

I can open a PR for this if this is a desired behavior.

Others also have similar issue: https://stackoverflow.com/questions/31503874/using-multilabelbinarizer-on-test-data-with-labels-not-in-the-training-set

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 16 (6 by maintainers)

Most upvoted comments

I am fully aware of that. But it is beside the point. If you want to use the MultiLabelBinarizer in e.g. a library your still in the position that you have to either:

  • filter the data before throwing it in the Binarizer just to prevent those warnings (which is a waste of compute and code complexity)
  • Introduce unexpected side effects (which is always bad). Where the side-effect is either:
    • A needless warning. Prompting users to think something is wrong where there is nothing worth their attention happening.
    • Muting warnings which the libraries user might not be aware of. Note that even though this warning makes no sense for the MultiLabelBinarizer (when classes are passed in the constructor) it might make sense in other scenarios so you do not want to mute it.

It seems inconsistent to not have an ignore option in MultiLabelBinarizer when OneHotEncoder has the handle_unknown option. Conditionally muting warnings is bad practice.

Seems like it would just be a matter of adding the option and changing this line: https://github.com/scikit-learn/scikit-learn/blob/fd237278e895b42abe8d8d09105cbb82dc2cbba7/sklearn/preprocessing/_label.py#L993 to if unknown and handle_unknown=='error':