scikit-learn: multiclass jaccard_similarity_score should not be equal to accuracy_score
The documentation for sklearn.metrics.jaccard_similarity_score
currently (version 0.17.1) states that:
In binary and multiclass classification, this function is equivalent to the accuracy_score. It differs in the multilabel classification problem.
However, I do not think that this is the right thing to do for multiclass-problems. As far as I can tell, within the machine learning community a more common usage of the Jaccard index for multi-class is to use the mean Jaccard-Index calculated for each class indivually. i.e., first calculate the jaccard index for class 0, class 1 and class 2, and then average them. This is what is very commonly done in the image segmentation community (where this is referred to as the “mean Intersection over Union” score (see e.g.[1]), but as far as I can tell by skimming it, this is also what the original publication of the jaccard index did in multiclass scenarios [2]. Note that this is NOT the same as the accuracy_score. Consider this example:
y_true = [0, 1, 2]
y_pred = [0, 0, 0]
The accuracy is clearly 1/3, and this is also what the jaccard_score in sklearn currently returns. The class-specific jaccard_scores would be:
J0 = 1 /3 J1 = 0 / 1 J2 = 0 / 1
Thus IMO the jaccard_score should be (J0 + J1 + J2) / 3 = 1/9 in this case
[1] e.g. Long et al, “The Pascal Visual Object Classes Challenge – a Retrospective”, https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf , but see any other paper on Semantic Segmentation
[2] Jaccard, “THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE”, http://onlinelibrary.wiley.com/doi/10.1111/j.1469-8137.1912.tb05611.x/abstract (Note that I have only skimmed the paper, but it seems to me that the author always reports the average of the “efficient of community” calculated over pairs whenever the author compares more than just 2 groups)
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 2
- Comments: 18 (10 by maintainers)
Commits related to this issue
- multiclass jaccard similarity not equal to accurary_score Fixes #7332 — committed to gxyd/scikit-learn by gxyd 7 years ago
- Fix _classification.py documentation See this issue: https://github.com/scikit-learn/scikit-learn/issues/7332 — committed to hafnerfe/scikit-learn by hafnerfe 3 years ago
@TSchattschneider I feel you. I remember how frustrated I was. I had the same problem. This bug should be flagged in the documentation, unless is fixed.
I agree that this seems to be strange even for the binary case. I would have thought Jaccard is an alternative to precision or recall or F1 (= Dice coefficient) in evaluating performance, in the binary case, on a single positive class, i.e. “true positives / (true positives + false positives + false negatives)”. In particular, the binary implementation in our case does not seem to equate to the multilabel implementation run over a single class.
Regarding @untom’s initial contention that multiclass implementation is incorrect, I agree that the multiclass implementation is useless. I don’t think that the macro averaging he suggests is the only way to go about it, and as with P/R/F, micro-averaging excluding a majority negative class is still meaningful; weighted macro-average may also be feasible.
So yes, multiple strange things in our jaccard implementation IMO, and at a glance I don’t see how the reference given in #1795 tells us about the multiclass case.
Labelling this a bug.