scikit-learn: ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue]
To reproduce:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.datasets import fetch_20newsgroups
data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')
clf=OneVsRestClassifier(estimator=SVC(),n_jobs=-1) ##Error when calling fit
#clf=OneVsRestClassifier(estimator=SVC(),n_jobs=1) ##NO Error if set to 1
pipeLine=Pipeline([('tfidf',TfidfVectorizer(min_df=10)),
('clf',clf)])
trainx=data_train.data
trainy=data_train.target
evalx=data_test.data
evaly=data_test.target
pipeLine.fit(trainx,trainy)
predictValue=pipeLine.predict(evalx)
print classification_report(evaly,predictValue)
Output:
ValueError: UPDATEIFCOPY base is read-only
Linux-4.2.0-19-generic-x86_64-with-Ubuntu-15.10-wily
('Python', '2.7.10 (default, Oct 14 2015, 16:09:02) \n[GCC 5.2.1 20151010]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.18.dev0')
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 19 (12 by maintainers)
Commits related to this issue
- BUG: Found workaround for first bug at https://github.com/scikit-learn/scikit-learn/issues/6614. Now trying to find cause of difference in shape of train and test features. — committed to thornhale/fake_news by thornhale 7 years ago
- Set number of jobs for SMOTE and ADASYN to 1 Because of ValueError: WRITEBACKIFCOPY base is read-only Bug: https://github.com/scikit-learn/scikit-learn/issues/6614 Solution: none. — committed to StefanoFrazzetto/CrimeDetector by StefanoFrazzetto 5 years ago
- Avoid ValueError in parallel computing of large arrays This PR introduces the optional *max_nbytes* parameter on *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learni... — committed to Ircama/scikit-learn by Ircama 5 years ago
- Avoid ValueError in parallel computing of large arrays This PR introduces the optional *max_nbytes* parameter on *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learni... — committed to Ircama/scikit-learn by Ircama 5 years ago
- Allowing optional list of Parallel keyworded parameters Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing... — committed to Ircama/scikit-learn by Ircama 5 years ago
- Allowing optional list of Parallel keyworded parameters Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing... — committed to Ircama/scikit-learn by Ircama 5 years ago
A more proper fix would to modify
sklearn.svm.base.BaseLibSVM._sparse_fit
so that it doesn’t modify X in place.Looking at this problem more closely I found a work-around in case it is useful:
This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn’t have its indices sorted.
Thanks @csvankhede ! Yes, it’s a scipy bug as mentionned above. Adding a workaround for all code that uses
sparse_array.astype
in scikit-learn would probably be difficult.Just for the record, would,
also work in your case?
Another related, maybe more central, issue is https://github.com/scikit-learn/scikit-learn/issues/5481. IIRC estimators should avoid in-place modification of
X
, e.g.X -= X.mean()
. cc @arthurmensch.More explanation: when the input data is big enough, joblib use memmapping by default. This allows to share the input data across workers instead of having a copy of the input data in each worker. See this for more details. The memmap is opened in read-only mode because of possible data corruption if different workers write into the same data.
If you have access to the
joblib.Parallel
object, you can usemax_nbytes=None
to disable memmaping. Whether it will be faster than just doingn_jobs=1
depends on your particular use case I reckon. From a few cases I looked at, it looks like you generally don’t have access to the underlyingjoblib.Parallel
object when you create your estimator, so that doesn’t really help in most cases.I got the similar Issue.
clf = OneVsRestClassifier(LogisticRegression(penalty = 'l2'),n_jobs = -1)
clf.fit(x_train_multilabel,y_train)
pred_y = clf.predict(x_test_multilabel)
Here when I changed the code as below it worked.
clf.fit(x_train_multilabel.copy(),y_train)
Thanks for the lucid explanation Loic!