scikit-learn: ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue]

To reproduce:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')




clf=OneVsRestClassifier(estimator=SVC(),n_jobs=-1)  ##Error when calling fit
#clf=OneVsRestClassifier(estimator=SVC(),n_jobs=1) ##NO Error if set to 1



pipeLine=Pipeline([('tfidf',TfidfVectorizer(min_df=10)),
    ('clf',clf)])

trainx=data_train.data
trainy=data_train.target
evalx=data_test.data
evaly=data_test.target
pipeLine.fit(trainx,trainy)

predictValue=pipeLine.predict(evalx)

print classification_report(evaly,predictValue)

Output:

ValueError: UPDATEIFCOPY base is read-only

Linux-4.2.0-19-generic-x86_64-with-Ubuntu-15.10-wily
('Python', '2.7.10 (default, Oct 14 2015, 16:09:02) \n[GCC 5.2.1 20151010]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.18.dev0')

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 19 (12 by maintainers)

Commits related to this issue

Most upvoted comments

A more proper fix would to modify sklearn.svm.base.BaseLibSVM._sparse_fit so that it doesn’t modify X in place.

Looking at this problem more closely I found a work-around in case it is useful:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')

clf = OneVsRestClassifier(estimator=SVC(), n_jobs=-1)


class MyTfidfVectorizer(TfidfVectorizer):
    def fit_transform(self, X, y):
        result = super(MyTfidfVectorizer, self).fit_transform(X, y)
        result.sort_indices()
        return result

pipeLine = Pipeline([('tfidf', MyTfidfVectorizer(min_df=10)),
                     ('clf', clf)])

trainx = data_train.data
trainy = data_train.target
evalx = data_test.data
evaly = data_test.target
pipeLine.fit(trainx, trainy)

predictValue = pipeLine.predict(evalx)

print(classification_report(evaly, predictValue))

This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn’t have its indices sorted.

Thanks @csvankhede ! Yes, it’s a scipy bug as mentionned above. Adding a workaround for all code that uses sparse_array.astype in scikit-learn would probably be difficult.

Just for the record, would,

x_train_multilabel.sort_indices()
clf.fit(x_train_multilabel, y_train)

also work in your case?

Another related, maybe more central, issue is https://github.com/scikit-learn/scikit-learn/issues/5481. IIRC estimators should avoid in-place modification of X, e.g. X -= X.mean(). cc @arthurmensch.

More explanation: when the input data is big enough, joblib use memmapping by default. This allows to share the input data across workers instead of having a copy of the input data in each worker. See this for more details. The memmap is opened in read-only mode because of possible data corruption if different workers write into the same data.

If you have access to the joblib.Parallel object, you can use max_nbytes=None to disable memmaping. Whether it will be faster than just doing n_jobs=1 depends on your particular use case I reckon. From a few cases I looked at, it looks like you generally don’t have access to the underlying joblib.Parallel object when you create your estimator, so that doesn’t really help in most cases.

I got the similar Issue.

clf = OneVsRestClassifier(LogisticRegression(penalty = 'l2'),n_jobs = -1) clf.fit(x_train_multilabel,y_train) pred_y = clf.predict(x_test_multilabel)

Here when I changed the code as below it worked.

clf.fit(x_train_multilabel.copy(),y_train)

Thanks for the lucid explanation Loic!