scikit-learn: Possible bug when combining SVC + class_weights='balanced' + LeaveOneOut
This piece of code yields perfect classification accuracy for random data:
import numpy as np
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.svm import SVC
scores = cross_val_score(SVC(kernel='linear', class_weight='balanced', C=1e-08),
np.random.rand(79, 100),
y=np.hstack((np.ones(20), np.zeros(59))),
cv=LeaveOneOut())
print(scores)
The problem disappears when using class_weight=None
or another CV.
Is it a bug or am I missing something?
Tested with version 0.19.1 of scikit-learn on Ubuntu Linux.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Comments: 23 (13 by maintainers)
I’m not aware of class weighting procedures other than ‘balanced’, which of course does not mean they don’t exist. In my opinion, the way this is handled with the ‘balanced’ option in sklearn is exemplary, precisely because the weights are computed on the training data only.
I would say that I came across the problem relatively organically. To elaborate, I was using
GridSearchCV
on the C parameter ofSVC
and after settingclass_weight='balanced'
I suddenly got amazing accuracies on a real-world data set (i.e., not artificial/random data). I then realized thatGridSearchCV
was selecting very low values of C, i.e. no regularization at all, which at first was even weirder.Based on this experience I’m inclined to recommend inclusion of your patch, because I’m sure many people will not investigate further when accuracies are good and ‘publishable’. The effect of changing class weights in the order of 1e-8 should be negligible in almost all cases, and if not, it’s likely because of this very issue. I see the trade-off with exact backwards compatibility though.