scikit-learn: metrics.log_loss fails when any classes are missing in y_true

When calling log_loss with a label array (i.e. not an indicator matrix) for y_true, it uses a LabelBinarizer to construct the indicator matrix.

If not all classes in y_pred are present in y_true, this has the wrong shape, and it raises ValueError: y_true and y_pred have different number of classes

Perhaps I don’t understand the intended use of this mode, but it seems like it would work better to infer the indicator matrix from the shape of y_pred, as this causes brittleness where things appear to work until a batch of samples happens to not include every class.

About this issue

Original URL
State: closed
Created 10 years ago
Comments: 16 (6 by maintainers)

Commits related to this issue

Add labels argument to log_loss to provide labels explicitly when number of classes in y_true and y_pred differ Fixes https://github.com/scikit-learn/scikit-learn/issues/4033 , https://github.com/sc... — committed to Yu-Group/iterative-Random-Forest by MechCoder 8 years ago

Most upvoted comments

Does it be solved? I also meet the problem~~~

+12

LumiaHuan on Mar 28, 2016

I agree, this seems brittle. A problem is that it is not clear which class is missing. If your classes are ["elephant", "tiger", "rhino"] but your y_true only contains ["elephant", "tiger"] it can not work.

The solution we use in other places is a classes parameter for the loss where you can provide all known classes.

amueller on Dec 30, 2014

No, the logs are calculated on the predicted probabilities, the bug here was when a certain class was missing in the true labels.

ebenolson on Aug 1, 2018

I am using 0.18.1 and have an erorr,

ValueError: y_true contains only one label (2). Please provide the true labels explicitly through the labels argument.

#Visualize Log Loss when True value = 1
#y-axis is log loss, x-axis is probabilty that label = 1
#As you can see Log Loss increases rapidly as we approach 0
#But increases slowly as our predicted probability gets closer to 1
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import log_loss

x = [i*.0001 for i in range(1,10000)]
y = [log_loss(y_true=[1],y_pred=[[i*.0001,1-(i*.0001)]],eps=1e-15,labels=None) for i in range(1,10000,1)]

plt.plot(x, y)
plt.axis([-.05, 1.1, -.8, 10])
plt.title("Log Loss when true label = 1")
plt.xlabel("predicted probability")
plt.ylabel("log loss")

plt.show()

taposh on Feb 19, 2017

The reason for the error has to do with the math. If there there are no members of a certain class of present, then it’s probability is zero. A perfect example would be a situation where both sets are all zero. Log loss depends upon the logarithm of the probability, and if the probability is zero, then the logarithm returns an error, because log(0) is undefined.

The way to deal with this problem is by using try except statements:

def log_loss1(a,b): try: k = log_loss(a,b) except: k = np.nan return k

jkginfinite on Aug 1, 2018