scikit-learn: metrics.log_loss fails when any classes are missing in y_true
When calling log_loss with a label array (i.e. not an indicator matrix) for y_true, it uses a LabelBinarizer to construct the indicator matrix.
If not all classes in y_pred are present in y_true, this has the wrong shape, and it raises
ValueError: y_true and y_pred have different number of classes
Perhaps I don’t understand the intended use of this mode, but it seems like it would work better to infer the indicator matrix from the shape of y_pred, as this causes brittleness where things appear to work until a batch of samples happens to not include every class.
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Comments: 16 (6 by maintainers)
Does it be solved? I also meet the problem~~~
I agree, this seems brittle. A problem is that it is not clear which class is missing. If your classes are
["elephant", "tiger", "rhino"]
but youry_true
only contains["elephant", "tiger"]
it can not work.The solution we use in other places is a
classes
parameter for the loss where you can provide all known classes.No, the logs are calculated on the predicted probabilities, the bug here was when a certain class was missing in the true labels.
I am using 0.18.1 and have an erorr,
The reason for the error has to do with the math. If there there are no members of a certain class of present, then it’s probability is zero. A perfect example would be a situation where both sets are all zero. Log loss depends upon the logarithm of the probability, and if the probability is zero, then the logarithm returns an error, because log(0) is undefined.
The way to deal with this problem is by using try except statements:
def log_loss1(a,b): try: k = log_loss(a,b) except: k = np.nan return k