scikit-learn: Support for regression by classification

My team and I are working on an application of regression by classification, a technique described in this article.

In a nut shell regression by classification is approaching a regression problem with multi-class classification algorithms. The key part of this technique is to perform discretization, or binning, of the (continous) target prior to classification. The article mentions 3 different approaches for target discretization which are supported by sklearn’s KBinsDiscretizer.

  1. Equally probable interval (this is the quantile strategy of KBinsDiscretizer)
  2. Equal width interval (this is the uniform strategy of KBinsDiscretizer)
  3. K-means clustering (this is the kmeans strategy of KBinsDiscretizer)

In regression by classification, the choice of the numbers of classes, the n_bins parameter, is critical. One straight forward way to tune this parameter and to choose the binning strategy is to use cross-validation. But because transformations on y (see #4143) are currently forbidden in scikit-learn, this is not “natively” supported.

We found a way around this by creating our own meta-estimator, as suggested by @jnothman elsewhere. But one problem remained. How can we tell scikit-learn to compute evaluation metrics on BINNED targets, and not the original CONTINOUS targets?

We achieved this by hacking the _PredictScorer class on our scikit-learn fork. The hack looks for a special custom method called get_transformed_targets on our home-brewed meta-estimator. If this method is present, the score is computed using transformed (binned) targets. Here is the hack:

class _PredictScorer(_BaseScorer):
    def _score(self, method_caller, estimator, X, y_true, sample_weight=None):
        """[... docstring ...]
        """
        #Here starts the hack
        if hasattr(estimator, 'get_transformed_targets'):
            y_true = estimator.get_transformed_targets(X, y_true)
        #Here ends the hack

        y_pred = method_caller(estimator, "predict", X)
        if sample_weight is not None:
            return self._sign * self._score_func(y_true, y_pred,
                                                 sample_weight=sample_weight,
                                                 **self._kwargs)
        else:
            return self._sign * self._score_func(y_true, y_pred,
                                                 **self._kwargs)

Another problem we encounter is to use the KBinsDiscretizer class on targets. We plan on doing this with a custom meta-transformer.

It would be nice if the regression by classification was supported by scikit-learn out of the box. Perhaps the re-sampling options coming soon will make this possible, but it will have to be tested.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 25 (20 by maintainers)

Most upvoted comments

We try to encourage good practice, particularly around evaluation. Evaluating in the classification space does not tell you about how well you solved the regression problem. I think the API can make it possible, but should not make it too easy or the default.