scikit-learn: Logistic regression prior adjustment during prediction when class_weight="balanced"
I quickly glanced through the code and did not find any intercept adjustment during test (prediction) time when the estimator is initialized as LogisticRegression(class_weight="balanced")
and the docs don’t suggest this either. Unless I am missing something, there does not appear to be any readjustment of the class-priors after an “artificially” balanced prior is used to train the model. If it is not implemented, might this be something worth adding?
References:
- https://stats.stackexchange.com/questions/117592/logistic-regression-prior-correction-at-test-time
- https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression
- https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 19 (15 by maintainers)
it’s a calibration problem. See how CalibrationClassifierCV works.
here I would just subclass and address this quite specific use-case.
Basically it’s just a matter of adjusting the intercept / bias term to optimize a specific scorer.
Tuning the threshold won’t help with the fact that the returned probability is not well calibrated (see e.g. https://github.com/scikit-learn/scikit-learn/pull/5250#issue-44725088 example).
Is it that specific though? One of the advantages of a logistic regression is that the probabilities are well calibrated (and we say so repeatedly in the docs). However that won’t be the case with
class_weight="balanced"
which is also quite a frequent option given the prevalence of imbalanced data in practice. https://github.com/scikit-learn/scikit-learn/pull/5250 gives a solution for the binary case, but it would need to be extended to the multi-class case.Overall I feel this would both useful and not that trivial to implement (particularly for the multi-class case) in a subclass to warrant addition in scikit-learn via an optional parameter. In practice I would likely rather run CalibrationClassifierCV than try to re-implement it by myself when needed, but if we get the same result without CV that would be all the better. Even if it only addresses this issue for LogisticRegression. +1 for addition on my part.
Trying to summarise where we are:
@shahlaebrahimi yes we are talking about a cost-sensitive problem . As said above, I think there are two ways of using
sample_weight
/class_weight
:Now that I’m reading again the comments, I’m more inclined to think that what we want is 2., as we can act on 1. by playing with the threshold when the model is fitted.
Thus I think that what we want to do is something like:
with
b = b' + log(classes proportion)
see @ashimb9’s reference 1. https://stats.stackexchange.com/questions/117592/logistic-regression-prior-correction-at-test-time