scikit-learn: ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue]

To reproduce:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')




clf=OneVsRestClassifier(estimator=SVC(),n_jobs=-1)  ##Error when calling fit
#clf=OneVsRestClassifier(estimator=SVC(),n_jobs=1) ##NO Error if set to 1



pipeLine=Pipeline([('tfidf',TfidfVectorizer(min_df=10)),
    ('clf',clf)])

trainx=data_train.data
trainy=data_train.target
evalx=data_test.data
evaly=data_test.target
pipeLine.fit(trainx,trainy)

predictValue=pipeLine.predict(evalx)

print classification_report(evaly,predictValue)

Output:

ValueError: UPDATEIFCOPY base is read-only

Linux-4.2.0-19-generic-x86_64-with-Ubuntu-15.10-wily
('Python', '2.7.10 (default, Oct 14 2015, 16:09:02) \n[GCC 5.2.1 20151010]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.18.dev0')

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 19 (12 by maintainers)

Commits related to this issue

BUG: Found workaround for first bug at https://github.com/scikit-learn/scikit-learn/issues/6614. Now trying to find cause of difference in shape of train and test features. — committed to thornhale/fake_news by thornhale 7 years ago
Set number of jobs for SMOTE and ADASYN to 1 Because of ValueError: WRITEBACKIFCOPY base is read-only Bug: https://github.com/scikit-learn/scikit-learn/issues/6614 Solution: none. — committed to StefanoFrazzetto/CrimeDetector by StefanoFrazzetto 5 years ago
Avoid ValueError in parallel computing of large arrays This PR introduces the optional *max_nbytes* parameter on *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learni... — committed to Ircama/scikit-learn by Ircama 5 years ago
Avoid ValueError in parallel computing of large arrays This PR introduces the optional *max_nbytes* parameter on *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learni... — committed to Ircama/scikit-learn by Ircama 5 years ago
Allowing optional list of Parallel keyworded parameters Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing... — committed to Ircama/scikit-learn by Ircama 5 years ago
Allowing optional list of Parallel keyworded parameters Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing... — committed to Ircama/scikit-learn by Ircama 5 years ago

Most upvoted comments

A more proper fix would to modify sklearn.svm.base.BaseLibSVM._sparse_fit so that it doesn’t modify X in place.

Looking at this problem more closely I found a work-around in case it is useful:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')

clf = OneVsRestClassifier(estimator=SVC(), n_jobs=-1)


class MyTfidfVectorizer(TfidfVectorizer):
    def fit_transform(self, X, y):
        result = super(MyTfidfVectorizer, self).fit_transform(X, y)
        result.sort_indices()
        return result

pipeLine = Pipeline([('tfidf', MyTfidfVectorizer(min_df=10)),
                     ('clf', clf)])

trainx = data_train.data
trainy = data_train.target
evalx = data_test.data
evaly = data_test.target
pipeLine.fit(trainx, trainy)

predictValue = pipeLine.predict(evalx)

print(classification_report(evaly, predictValue))

This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn’t have its indices sorted.

+12

lesteve on Apr 14, 2016

Thanks @csvankhede ! Yes, it’s a scipy bug as mentionned above. Adding a workaround for all code that uses sparse_array.astype in scikit-learn would probably be difficult.

Just for the record, would,

x_train_multilabel.sort_indices()
clf.fit(x_train_multilabel, y_train)

also work in your case?

rth on Jun 28, 2019

Another related, maybe more central, issue is https://github.com/scikit-learn/scikit-learn/issues/5481. IIRC estimators should avoid in-place modification of X, e.g. X -= X.mean(). cc @arthurmensch.

More explanation: when the input data is big enough, joblib use memmapping by default. This allows to share the input data across workers instead of having a copy of the input data in each worker. See this for more details. The memmap is opened in read-only mode because of possible data corruption if different workers write into the same data.

If you have access to the joblib.Parallel object, you can use max_nbytes=None to disable memmaping. Whether it will be faster than just doing n_jobs=1 depends on your particular use case I reckon. From a few cases I looked at, it looks like you generally don’t have access to the underlying joblib.Parallel object when you create your estimator, so that doesn’t really help in most cases.

lesteve on Apr 12, 2016

I got the similar Issue.

clf = OneVsRestClassifier(LogisticRegression(penalty = 'l2'),n_jobs = -1) clf.fit(x_train_multilabel,y_train) pred_y = clf.predict(x_test_multilabel)

Here when I changed the code as below it worked.

clf.fit(x_train_multilabel.copy(),y_train)

csvankhede on Jun 28, 2019

Thanks for the lucid explanation Loic!

raghavrv on Apr 12, 2016