scikit-learn: Cross-validation returning multiple scores
Scorer
objects currently provide an interface that returns a scalar score given an estimator and test data. This is necessary for *SearchCV
to calculate a mean score across folds, and determine the best score among parameters.
This is very hampering in terms of the diagnostic information available from a cross-fold validation or parameter exploration, which one can see by comparing to the catalogue of metrics
that includes: precision and recall with F-score; scores for each of multiple classes as well as an aggregate; and error distributions (i.e. PR-curve or confusion matrix). @solomonm (#1837) and I (ML, an implementation within #1768) have independently sought Precision and Recall to be returned from cross-validation routines when F1 is used as the cross-validation objective; @eickenberg on https://github.com/scikit-learn/scikit-learn/pull/1381#commitcomment-2607318 posed a concern regarding array of scores corresponding to multiple targets.
I thought it deserved an Issue of its own to solidify the argument and its solution.
Some design options:
- Allow multiple scorers to be provided to
cross_val_score
or*SearchCV
(henceforthCVEvaluator
), with one specified as the objective. But since theScorer
generally callsestimator.{predict,decision_function,predict_proba}
, each scorer would repeat this work. - Separate the objective and non-objective metrics as parameters to
CVEvaluator
: thescoring
parameter remains as it is and adiagnostics
parameter provides a callable with similar (same?) arguments asScorer
, but returning a dict. This means that the prediction work is repeated but not necessarily as many times as there are metrics. This diagnostics callable is more flexible and perhaps could be passed the training data as well as the test data. - Continue to use the
scoring
parameter, but allow theScorer
to return a dict with a special key for the objective score. This would need to be handled by the caller. For backwards compatibility, no existing scorers would change their behaviour of returning a float. This ensures no repeated prediction work. - Add an additional method to the
Scorer
interface that generates a set of named outputs (as withcalc_names
proposed in #1837), again with a special key for the objective score. This allows users to continue usingscoring='f1'
but get back precision and recall for free.
Note that 3. and 4. potentially allow for any set of metrics to be composed into a scorer without redundant prediction work (and 1. allows composition with highly redundant prediction work).
Comments, critiques and suggestions are very welcome.
About this issue
- Original URL
- State: closed
- Created 11 years ago
- Reactions: 10
- Comments: 33 (30 by maintainers)
Did this go anywhere? It would be really nice to pass a list of metrics to cross val score and get a list of scores in the same order or a dict with metric names as keys.
@raghavrv this is still open, right? Is there a PR?
I think we’re close, and there’s a fair chance you’ll see this in 0.19. But we have no desire to rush into design that then need to be redesigned.
I’m interested in whether the current proposal (#7388), allowing multiple values for scoring is better than a generic callback to extract diagnostic info from each fit, or whether we need both…
On 3 Mar 2017 1:02 am, “RokoMijic” notifications@github.com wrote:
Any progress on this front? I am busy putting an explanation in a docstring for some code telling the reader why I am re-implementing crossvalidation rather than using scikit-learn.
Are we going to send this issue to elementary school? It’s going to be 4 years old soon! 😉 Anyway just do add, scikit-learn is awesome and I’m really grateful for the hard work that people put into it!