scikit-learn: Unavoidable "y_true and y_pred contain different number of classes" error inside a CV loop

Description

During cross-validation on a multi-class problem, it’s technically possible to have classes present in the test data that don’t appear in the training data.

Steps/Code to Reproduce

import numpy as np
from sklearn.metrics import make_scorer, log_loss
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB

rs = np.random.RandomState(1389057)

y = [
    'cow',
    'hedgehog',
    'fox',
    'fox',
    'hedgehog',
    'fox',
    'hedgehog',
    'cow',
    'cow',
    'fox'
]

x = rs.normal([0, 0], [1, 1], size=(len(y), 2))

model = BernoulliNB()

cv = StratifiedKFold(4, shuffle=True, random_state=rs)

param_dist = {
    'alpha': np.logspace(np.log(0.1), np.log(1), 20)
}

search = RandomizedSearchCV(model, param_dist, 5,
                            scoring=make_scorer(log_loss, needs_proba=True), cv=cv)

search.fit(x, y)

Expected Results

Either:

  1. Predicted classes from predict_proba are aligned with classes in the full training data, not just the in-fold subset.
  2. Classes not in the training data are ignored in the test data.

Actual Results

Predicted classes from predict_proba are aligned with classes in the in-fold subset only, but classes not in the training data are still used in the test data, causing the error.

I understand that this is normatively “correct” behavior, but it makes it hard/impossible to use in cross-validation with the existing APIs.

From my perspective, the best solution would be to have RandomizedSearchCV pass a labels=self.classes_ argument to its scorer. I’m not sure how well that generalizes.

Versions

Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-redhat-7.3-Maipo
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.1

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 1
  • Comments: 15 (6 by maintainers)

Most upvoted comments

What about scoring=make_scorer(log_loss, needs_proba=True, labels=y)?

Use model.classes_ attribute

y_probs = model.predict_proba(X_test) log_loss(y_test, y_probs, label=model.classes_)

The warning I’m getting is:

sklearn/model_selection/_split.py:605: Warning: The least populated class in y has only 3 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4

which makes sense. You cannot have a stratified CV if the number of folds is greater than the number of instances from a given class: that would mean one or more of the folds would have no instance of that class.

Setting n_folds=3 removes the warning