scikit-learn: Error in using multi-label classification in partial_fit() in OvR

Description

When using OneVsRestClassifier() with partial_fit() method, errors are thrown. When using fit(), no errors are thrown and everything works.

Steps/Code to Reproduce

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

categories = ['a','b','c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b', 'c'],['a', 'b']] 

mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,         non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))

X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)

- Case1
clf.partial_fit(X_train, Y_train, categories)
- Case2
clf.partial_fit(X_train, Y_train, mlb.transform(Y))

Description of code

  • Case1 Using classes=categories without transforming partial_fit(X_train, Y_train, classes=categories)

      ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
    
  • Case2 Using classes=mlb.transform(categories) i.e. after transforming from same multilabelbinarizer partial_fit(X_train, Y_train, classes=mlb.transform(categories))

       ValueError: The object was not fitted with multilabel input.
    

Expected Results

No error is thrown as when using fit().

Actual Results

  • Case1

Traceback (most recent call last): File “/path_to_module/Check.py”, line 18, in <module> clf.partial_fit(X_train, Y_train, categories) File “/library/python2.7/dist-packages/sklearn/multiclass.py”, line 260, in partial_fit if np.setdiff1d(y, self.classes_): ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

  • Case2

Traceback (most recent call last): File “/path_to_module/Check.py”, line 18, in <module> clf.partial_fit(X_train, Y_train, mlb.transform(Y)) File “/library/python2.7/dist-packages/sklearn/multiclass.py”, line 265, in partial_fit Y = self.label_binarizer_.transform(y) File “/library/python2.7/dist-packages/sklearn/preprocessing/label.py”, line 329, in transform raise ValueError(“The object was not fitted with multilabel” ValueError: The object was not fitted with multilabel input.

Observation

Versions

Linux-3.16.0-77-generic-x86_64-with-Ubuntu-14.04-trusty (‘Python’, ‘2.7.6 (default, Oct 26 2016, 20:30:19) \n[GCC 4.8.4]’) (‘NumPy’, ‘1.12.0’) (‘SciPy’, ‘0.18.1’) (‘Scikit-Learn’, ‘0.18.1’)

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 6
  • Comments: 18 (12 by maintainers)

Most upvoted comments

Currently trying to incrementally train a multi-label dataset that’s too big to fit in memory. Would be awesome if this started working! 😃

Hello, is there any one following this issue? I think the problem is in Line 259 of multiclass.py:

self.label_binarizer_.fit(self.classes_)

self.classes is the array of all possible classes passed in through the OneVsRestClassifier’s partial_fit call. Directly feeding this to the label binarizer will make the binarizer think that it is a multi-class problem.

I think a solution is, when we detect that y is in indicator format and is thus a multilabel task, to create a temporary vector of length len(self.classes_) with all ones and feed it to the label binarizer. In that case it will recognizes correctly that it is multilabel.

My fix is available on my repository: https://github.com/albertauyeung/scikit-learn/tree/fix-ovr-multilabel-partial-fit

Thanks.

Hello,

I added self.y_type_ = ‘multilabel-indicator’ before line 329 label.py and it worked for me 😃. Not sure if this affects other computations though.

Thanks!

@dokato Yes, I agree with you that the Case1 is just the misuse of partial_fit(). But I tried various combinations like classes=Y_train, transformed Y_train (As in Case2), etc. But that still detects the target as multi-class and error as in Case2 is thrown. @jnothman fit() automatically infers the type of target as ‘multilabel’ from Y_train. But for first call to partial_fit(), we need to pass on all classes and type_of_target is inferred from that which is wrongly detected as multiclass. I could not find a way to supply classes to the partial_fit() method which can be detected as multi_label. This is the issue. There may be some misunderstanding of the API from my side. Is there any safe way to pass this into partial_fit().