scikit-learn: metrics.log_loss fails when any classes are missing in y_true

When calling log_loss with a label array (i.e. not an indicator matrix) for y_true, it uses a LabelBinarizer to construct the indicator matrix.

If not all classes in y_pred are present in y_true, this has the wrong shape, and it raises ValueError: y_true and y_pred have different number of classes

Perhaps I don’t understand the intended use of this mode, but it seems like it would work better to infer the indicator matrix from the shape of y_pred, as this causes brittleness where things appear to work until a batch of samples happens to not include every class.

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Comments: 16 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Does it be solved? I also meet the problem~~~

I agree, this seems brittle. A problem is that it is not clear which class is missing. If your classes are ["elephant", "tiger", "rhino"] but your y_true only contains ["elephant", "tiger"] it can not work.

The solution we use in other places is a classes parameter for the loss where you can provide all known classes.

No, the logs are calculated on the predicted probabilities, the bug here was when a certain class was missing in the true labels.

I am using 0.18.1 and have an erorr,

ValueError: y_true contains only one label (2). Please provide the true labels explicitly through the labels argument.

#Visualize Log Loss when True value = 1
#y-axis is log loss, x-axis is probabilty that label = 1
#As you can see Log Loss increases rapidly as we approach 0
#But increases slowly as our predicted probability gets closer to 1
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import log_loss

x = [i*.0001 for i in range(1,10000)]
y = [log_loss(y_true=[1],y_pred=[[i*.0001,1-(i*.0001)]],eps=1e-15,labels=None) for i in range(1,10000,1)]

plt.plot(x, y)
plt.axis([-.05, 1.1, -.8, 10])
plt.title("Log Loss when true label = 1")
plt.xlabel("predicted probability")
plt.ylabel("log loss")

plt.show()

The reason for the error has to do with the math. If there there are no members of a certain class of present, then it’s probability is zero. A perfect example would be a situation where both sets are all zero. Log loss depends upon the logarithm of the probability, and if the probability is zero, then the logarithm returns an error, because log(0) is undefined.

The way to deal with this problem is by using try except statements:

def log_loss1(a,b): try: k = log_loss(a,b) except: k = np.nan return k