scikit-learn: Bug: GMM ``score()`` returns an array, not a value.

The GMM.score() function returns an array, rather than a single value. This is inconsistent with the rest of scikit-learn: for example both sklearn.base.ClassifierMixin and sklearn.base.RegressorMixin implement a score() function which returns a single number, as do KMeans, KernelDensity, PCA, GaussianHMM, and others.

Currently, GMM.score() returns an array of the individual scores for each sample: this should probably be called GMM.score_samples(), and GMM.score() should return sum(GMM.score_samples()).

Note that in the last release, we renamed GMM.eval() to GMM.score_samples(). I believe this was a mistake: the score_samples label has a very general meaning (e.g. it is used within KernelDensity), while the results of GMM.eval() return a tuple containing the per-cluster likelihoods, which makes sense only with GMM.

If this change were made so that GMM.score() returned a single number, then the following recipe would work to optimize a GMM model (as it does for, e.g. KDE). As it is, this recipe fails for GMM:

import numpy as np
from sklearn.mixture import GMM
from sklearn.datasets import make_blobs
from sklearn.grid_search import GridSearchCV

X, y = make_blobs(100, 2, centers=3)

# use grid search cross-validation to optimize the gmm model
params = {'n_components': range(1, 5)}
grid = GridSearchCV(GMM(), params)
grid.fit(X)

print grid.best_estimator_.n_components

The result:

ValueError: scoring must return a number, got <type 'numpy.ndarray'> instead.

About this issue

  • Original URL
  • State: closed
  • Created 11 years ago
  • Reactions: 1
  • Comments: 19 (19 by maintainers)

Commits related to this issue

Most upvoted comments

And what should we do with VBGMM and DPGMM? What is the difference between the score_samples and predict_proba? predict_proba returns per component likelihoods, right?

You are saying that score_samples should provide one score per sample, right? I agree. It should be the likelyhood of each point under the model. And we need a method that gives cluster responsibilities. I thought that was predict_proba. Is it not?

Are there any other changes to the API that we need?

so score_samples would be a log-space version of density?

That’s what I’m thinking. In most cases I believe the sample score will essentially be the log-likelihood under the model. Do you know of any examples where this assumption might not hold?

so score_samples would be a log-space version of density?

You are saying that score_samples should provide one score per sample, right? I agree. It should be the likelihood of each point under the model.

Yes.

Are there any other changes to the API that we need?

The score method should probably return a single number, as it does across the rest of the package. This change would allow GMM to be used automatically with cross-validation. Currently, score returns a score per sample, which is inconsistent with the rest of the package.

For example, this works with KernelDensity:

import numpy as np
X = np.random.randn(100, 2)

from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KernelDensity
grid = GridSearchCV(KernelDensity(), {'bandwidth': [0.05, 0.1, 0.2]})
grid.fit(X)

But the equivalent does not currently work with GMM: it leads to an error because gmm.score returns an array:

from sklearn.mixture import GMM
grid = GridSearchCV(GMM(), {'n_components': [1, 2, 3]})
grid.fit(X) # <-- fails

And we need a method that gives cluster responsibilities.

True. I think you’re right that predict_proba returns this already, and this is consistent with the way supervised classifiers return this information.