scikit-learn: Logistic regression prior adjustment during prediction when class_weight="balanced"

I quickly glanced through the code and did not find any intercept adjustment during test (prediction) time when the estimator is initialized as LogisticRegression(class_weight="balanced") and the docs don’t suggest this either. Unless I am missing something, there does not appear to be any readjustment of the class-priors after an “artificially” balanced prior is used to train the model. If it is not implemented, might this be something worth adding?

References:

  1. https://stats.stackexchange.com/questions/117592/logistic-regression-prior-correction-at-test-time
  2. https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression
  3. https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 19 (15 by maintainers)

Most upvoted comments

it’s a calibration problem. See how CalibrationClassifierCV works.

here I would just subclass and address this quite specific use-case.

Basically it’s just a matter of adjusting the intercept / bias term to optimize a specific scorer.

For me, the right solution is to implement a wrapper object that shifts the threshold on the decision function

Tuning the threshold won’t help with the fact that the returned probability is not well calibrated (see e.g. https://github.com/scikit-learn/scikit-learn/pull/5250#issue-44725088 example).

here I would just subclass and address this quite specific use-case.

Is it that specific though? One of the advantages of a logistic regression is that the probabilities are well calibrated (and we say so repeatedly in the docs). However that won’t be the case with class_weight="balanced" which is also quite a frequent option given the prevalence of imbalanced data in practice. https://github.com/scikit-learn/scikit-learn/pull/5250 gives a solution for the binary case, but it would need to be extended to the multi-class case.

Overall I feel this would both useful and not that trivial to implement (particularly for the multi-class case) in a subclass to warrant addition in scikit-learn via an optional parameter. In practice I would likely rather run CalibrationClassifierCV than try to re-implement it by myself when needed, but if we get the same result without CV that would be all the better. Even if it only addresses this issue for LogisticRegression. +1 for addition on my part.

Trying to summarise where we are:

@shahlaebrahimi yes we are talking about a cost-sensitive problem . As said above, I think there are two ways of using sample_weight / class_weight:

  1. predictive costs: to reflect a specific cost we have (when predicting). E.g. we may want to be more accurate on some specific inputs or on a specific class.
  2. modelling costs: to train our model better, e.g to circumvent unbalanced class in training.

Now that I’m reading again the comments, I’m more inclined to think that what we want is 2., as we can act on 1. by playing with the threshold when the model is fitted.

Thus I think that what we want to do is something like:

LogisticRegression().fit(X) -> (w, b)
LogisticRegression(class_weight="balanced").fit(X) -> (w', b')
LogisticRegression(class_weight="balanced", predict_balanced=False).fit(X) -> (w', b)

with b = b' + log(classes proportion) see @ashimb9’s reference 1. https://stats.stackexchange.com/questions/117592/logistic-regression-prior-correction-at-test-time