scikit-learn: metrics.confusion_matrix far too slow for Boolean cases
Description
When using metrics.confusion_matrix with np.bool_ cases (e.g. with only True/False values), it is far more fast to not use list comprehensions as the current code does. numpy has sum and Boolean logic functions to deal with this very efficiently to scale far better. Code found in: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/_classification.py which uses list comprehensions and never checks for np.bool_ types trying to avoid all the excessive code. This is a very common and reasonable use case in practice. Assuming normalization and no sample weights though those also can be efficiently dealt with.
Steps/Code to Reproduce
import numpy as np
N = 4096
p = 0.5
a = np.random.choice(a=[False, True], size=(N), p=[p, 1-p])
b = np.random.choice(a=[False, True], size=(N), p=[p, 1-p])
from sklearn.metrics import confusion_matrix
for i in range(1024): confusion_matrix(a, b)
Expected Results
Fast execution time. e.g. substituting confusion_matrix with this conf_mat (not efficient as possible but easier to read and more efficient than current library even with 4 sums, 4 logical ANDs and 4 logical NOTs:
def conf_mat(x, y):
return np.array([[np.sum(~x & ~y), np.sum(~x & y)], #true negatives, false positives
[np.sum(x & ~y), np.sum(x & y)]]) #false negatives, true positives
Or even faster with 1 logical AND with 3 sums:
def conf_mat(x, y):
truepos, totalpos, totaltrue = np.sum(x & y), np.sum(y), np.sum(x)
totalfalse = len(x) - totaltrue
return np.array([[totalfalse - falseneg, totalpos - truepos], #true negatives, false positives
[totaltrue - truepos, truepos]]) #false negatives, true positives
Actual Results
Slow execution time around 60 times slower than efficient code. The np.bool_ case could be identified and efficient code applied otherwise in serious cases of scale, the current code is too slow to be practically usable.
Versions
All - including 0.21
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 19 (18 by maintainers)
Also consider, if x and y are Boolean:
I am evaluation 3D ML-based segmentation predictions and was looking for a fast confusion matrix implementation. My observation was that sklearn.metrics.confusion_matrix is quite slow. In fact it is slower than loading the data and doing inference with a UNet.
I did a comparison of different ways to compute the confusion matrix:
The results are:
The timing for the numba implementation can be optimized further (by half) if
num_classes
is known, using dispatch viagenerated_jit
to skip computing the max ofa
andb
.Yes, a cython solution may also be fast, but this is really not the bottle neck in most people’s machine learning pipelines
Update: 1 logical AND with 3 sums version edited into post above as its even more efficient, also the execution time difference now known exactly.
Output:
Removing 3 logical ANDs, 4 logical negations and a sum operation hardly makes a difference in numpy.
2718/46=59 times faster…
As for doing large batches:
I see:
So apparently its faster in small batches than large batches but still many times slower than native numpy vectorized operations as list comprehensions do not have anyway to match such speed.