scikit-learn: Error in using multi-label classification in partial_fit() in OvR
- StackOverflow Question: https://stackoverflow.com/questions/42280439/multi-label-out-of-core-learning-for-text-data-valueerror-on-partial-fit
Description
When using OneVsRestClassifier() with partial_fit() method, errors are thrown. When using fit(), no errors are thrown and everything works.
Steps/Code to Reproduce
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
import numpy as np
categories = ['a','b','c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b', 'c'],['a', 'b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
- Case1
clf.partial_fit(X_train, Y_train, categories)
- Case2
clf.partial_fit(X_train, Y_train, mlb.transform(Y))
Description of code
-
Case1 Using classes=categories without transforming
partial_fit(X_train, Y_train, classes=categories)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
-
Case2 Using classes=mlb.transform(categories) i.e. after transforming from same multilabelbinarizer
partial_fit(X_train, Y_train, classes=mlb.transform(categories))
ValueError: The object was not fitted with multilabel input.
Expected Results
No error is thrown as when using fit().
Actual Results
- Case1
Traceback (most recent call last): File “/path_to_module/Check.py”, line 18, in <module> clf.partial_fit(X_train, Y_train, categories) File “/library/python2.7/dist-packages/sklearn/multiclass.py”, line 260, in partial_fit if np.setdiff1d(y, self.classes_): ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
- Case2
Traceback (most recent call last): File “/path_to_module/Check.py”, line 18, in <module> clf.partial_fit(X_train, Y_train, mlb.transform(Y)) File “/library/python2.7/dist-packages/sklearn/multiclass.py”, line 265, in partial_fit Y = self.label_binarizer_.transform(y) File “/library/python2.7/dist-packages/sklearn/preprocessing/label.py”, line 329, in transform raise ValueError(“The object was not fitted with multilabel” ValueError: The object was not fitted with multilabel input.
Observation
-
In Case1, the error is because https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multiclass.py#L260 is returning an array of booleans whereas it expects a single boolean value hence the error. But couldnt find what to do about it. How to pass Y or classes into it.
-
In Case2, the error occurs because partial_fit() calls the check_partial_fit_first_call() function which sets the clf.classes in a different way using unique_labels() as seen here https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/multiclass.py#L308. These unique labels are passed to clf.label_binarizer_ in this line https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multiclass.py#L258, which leads to it assuming type of targets as ‘multiclass’ whereas actual type of target is multilabel, hence this error. The fit() method handles classes_ in a different way (doesnt use unique_labels) and hence everything works correctly
Versions
Linux-3.16.0-77-generic-x86_64-with-Ubuntu-14.04-trusty (‘Python’, ‘2.7.6 (default, Oct 26 2016, 20:30:19) \n[GCC 4.8.4]’) (‘NumPy’, ‘1.12.0’) (‘SciPy’, ‘0.18.1’) (‘Scikit-Learn’, ‘0.18.1’)
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 6
- Comments: 18 (12 by maintainers)
Currently trying to incrementally train a multi-label dataset that’s too big to fit in memory. Would be awesome if this started working! 😃
Hello, is there any one following this issue? I think the problem is in Line 259 of multiclass.py:
self.classes
is the array of all possible classes passed in through the OneVsRestClassifier’spartial_fit
call. Directly feeding this to the label binarizer will make the binarizer think that it is a multi-class problem.I think a solution is, when we detect that
y
is in indicator format and is thus a multilabel task, to create a temporary vector of lengthlen(self.classes_)
with all ones and feed it to the label binarizer. In that case it will recognizes correctly that it is multilabel.My fix is available on my repository: https://github.com/albertauyeung/scikit-learn/tree/fix-ovr-multilabel-partial-fit
Thanks.
Hello,
I added self.y_type_ = ‘multilabel-indicator’ before line 329 label.py and it worked for me 😃. Not sure if this affects other computations though.
Thanks!
@dokato Yes, I agree with you that the Case1 is just the misuse of
partial_fit()
. But I tried various combinations like classes=Y_train, transformed Y_train (As in Case2), etc. But that still detects the target asmulti-class
and error as in Case2 is thrown. @jnothmanfit()
automatically infers the type of target as ‘multilabel’ fromY_train
. But for first call topartial_fit()
, we need to pass on all classes and type_of_target is inferred from that which is wrongly detected as multiclass. I could not find a way to supplyclasses
to thepartial_fit()
method which can be detected as multi_label. This is the issue. There may be some misunderstanding of the API from my side. Is there any safe way to pass this intopartial_fit()
.