scikit-learn: Bug: GMM ``score()`` returns an array, not a value.
The GMM.score()
function returns an array, rather than a single value. This is inconsistent with the rest of scikit-learn: for example both sklearn.base.ClassifierMixin
and sklearn.base.RegressorMixin
implement a score()
function which returns a single number, as do KMeans
, KernelDensity
, PCA
, GaussianHMM
, and others.
Currently, GMM.score()
returns an array of the individual scores for each sample: this should probably be called GMM.score_samples()
, and GMM.score()
should return sum(GMM.score_samples())
.
Note that in the last release, we renamed GMM.eval()
to GMM.score_samples()
. I believe this was a mistake: the score_samples
label has a very general meaning (e.g. it is used within KernelDensity
), while the results of GMM.eval()
return a tuple containing the per-cluster likelihoods, which makes sense only with GMM.
If this change were made so that GMM.score()
returned a single number, then the following recipe would work to optimize a GMM model (as it does for, e.g. KDE). As it is, this recipe fails for GMM:
import numpy as np
from sklearn.mixture import GMM
from sklearn.datasets import make_blobs
from sklearn.grid_search import GridSearchCV
X, y = make_blobs(100, 2, centers=3)
# use grid search cross-validation to optimize the gmm model
params = {'n_components': range(1, 5)}
grid = GridSearchCV(GMM(), params)
grid.fit(X)
print grid.best_estimator_.n_components
The result:
ValueError: scoring must return a number, got <type 'numpy.ndarray'> instead.
About this issue
- Original URL
- State: closed
- Created 11 years ago
- Reactions: 1
- Comments: 19 (19 by maintainers)
Commits related to this issue
- Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_samples```, ```score``` — committed to xuewei4d/scikit-learn by xuewei4d 9 years ago
- Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_samples```, ```score``` — committed to xuewei4d/scikit-learn by xuewei4d 9 years ago
- Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_samples```, ```score``` — committed to xuewei4d/scikit-learn by xuewei4d 9 years ago
- Fix #2473. Add ```DensityMixin```. Change API of GMM, ```score_samples```, ```score``` — committed to xuewei4d/scikit-learn by xuewei4d 9 years ago
And what should we do with
VBGMM
andDPGMM
? What is the difference between thescore_samples
andpredict_proba
?predict_proba
returns per component likelihoods, right?You are saying that
score_samples
should provide one score per sample, right? I agree. It should be the likelyhood of each point under the model. And we need a method that gives cluster responsibilities. I thought that waspredict_proba
. Is it not?Are there any other changes to the API that we need?
That’s what I’m thinking. In most cases I believe the sample score will essentially be the log-likelihood under the model. Do you know of any examples where this assumption might not hold?
so
score_samples
would be a log-space version ofdensity
?Yes.
The
score
method should probably return a single number, as it does across the rest of the package. This change would allow GMM to be used automatically with cross-validation. Currently,score
returns a score per sample, which is inconsistent with the rest of the package.For example, this works with
KernelDensity
:But the equivalent does not currently work with
GMM
: it leads to an error becausegmm.score
returns an array:True. I think you’re right that
predict_proba
returns this already, and this is consistent with the way supervised classifiers return this information.