scikit-learn: Possible bug when combining SVC + class_weights='balanced' + LeaveOneOut

This piece of code yields perfect classification accuracy for random data:

import numpy as np
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.svm import SVC

scores = cross_val_score(SVC(kernel='linear', class_weight='balanced', C=1e-08), 
                         np.random.rand(79, 100), 
                         y=np.hstack((np.ones(20), np.zeros(59))), 
                         cv=LeaveOneOut())
print(scores)

The problem disappears when using class_weight=None or another CV.

Is it a bug or am I missing something?

Tested with version 0.19.1 of scikit-learn on Ubuntu Linux.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 23 (13 by maintainers)

Most upvoted comments

I’m not aware of class weighting procedures other than ‘balanced’, which of course does not mean they don’t exist. In my opinion, the way this is handled with the ‘balanced’ option in sklearn is exemplary, precisely because the weights are computed on the training data only.

I would say that I came across the problem relatively organically. To elaborate, I was using GridSearchCV on the C parameter of SVC and after setting class_weight='balanced' I suddenly got amazing accuracies on a real-world data set (i.e., not artificial/random data). I then realized that GridSearchCV was selecting very low values of C, i.e. no regularization at all, which at first was even weirder.

Based on this experience I’m inclined to recommend inclusion of your patch, because I’m sure many people will not investigate further when accuracies are good and ‘publishable’. The effect of changing class weights in the order of 1e-8 should be negligible in almost all cases, and if not, it’s likely because of this very issue. I see the trade-off with exact backwards compatibility though.