statsmodels: Ridge regression in GLM gives incorrect results
Describe the bug
Something unexpected occurs when I perform a logistic regression. I believe it has got something to do with the intercept in the model. It only occurs when I use fit_regularised() and, in particular, when L1_wt is zero. In this case, the hidden function _fit_ridge() is used.
Note: Possibly related to https://github.com/statsmodels/statsmodels/issues/3624.
Code Sample, a copy-pastable example if possible
import numpy
import scipy.special
import statsmodels.api
import matplotlib.pyplot
import sys
print("sys version: " + sys.version)
print("statsmodels version: " + statsmodels.__version__ )
print()
numpy.random.seed(123)
n = 50
interceptData = numpy.repeat(1, n)
timeData = numpy.arange(0, n)
sinData = numpy.sin(2 * numpy.pi * timeData / n)
errorData = numpy.random.normal(0, 1, n)
yData = scipy.special.expit(2.3 - 0.1 * timeData + 2.0 * sinData + 0.6 * errorData)
print("yData min: " + str(numpy.min(yData)))
print("yData mean: " + str(numpy.mean(yData)))
print("yData max: " + str(numpy.max(yData)))
for timeAdd in [0, 999]:
for interceptAdd in [0, 999]:
for L1_wtValue in [0, 1e-9]:
print()
print("timeAdd: " + str(timeAdd) + " interceptAdd: " + str(interceptAdd) + " L1_wtValue: " + str(L1_wtValue))
interceptDataNew = interceptData + interceptAdd
timeDataNew = timeData + timeAdd
exogData = numpy.transpose([interceptDataNew, timeDataNew, sinData])
model = statsmodels.api.GLM(endog=yData, exog=exogData, family=statsmodels.api.families.Binomial())
fit = model.fit_regularized(alpha=0, L1_wt=L1_wtValue)
print("parameters: " + str(fit.params))
print("fit min: " + str(numpy.min(fit.fittedvalues)))
print("fit mean: " + str(numpy.mean(fit.fittedvalues)))
print("fit max: " + str(numpy.max(fit.fittedvalues)))
print("MSE: " + str(numpy.mean(numpy.power(fit.fittedvalues - yData, 2))))
matplotlib.pyplot.scatter(timeData, yData, color="red")
matplotlib.pyplot.scatter(timeData, fit.fittedvalues, color="green")
matplotlib.pyplot.show()
Expected Output
The models should all produce identical output, with a low MSE.
Output of ``import statsmodels.api statsmodels.version `
sys version: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] statsmodels version: 0.10.2
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (12 by maintainers)
My guess is that is a similar scaling problem as we have with Poisson because of the
expfunction. In Poisson large exog, make gradient and hessian very large because of theexpand can result in over-and underflow problems. In one version bfgs was changed in scipy which made it less robust to large gradients and many of our Poisson unit test cases started to fail with bfgs.I don’t like standardizing data when the parameters should have an interpretation, but often some rescaling can be meaning full and makes it numerically much easier. example: In Poisson we had a test case with age as explanatory variable, where age in the sample was around 65 to 90. Our Poisson couldn’t handle this without changing the scale, e.g. subtracting 65 or 75 and/or dividing by 10 (i.e. decades).
It’s still an open question whether we should add internal rescaling, #1131. We traditionally don’t automatically change the data that the user provides. alternative: a
fit_transformedmethod, but that wouldn’t work withfit_regularized.Another point is how good the starting values for the scipy optimizers are. Often the problem is much more severe when we are not yet close to the optimum. GLM uses a few steps of IRLS as starting values before switching to scipy optimizers. In discrete count models, I spend a lot of time trying to find good starting values, so we can have a least some bad scaling or bad conditioning before our default optimization fails.
I have not seen any issues when the intercept is 1. When the intercept is 1000 there was an issue that seems to be resolved by #6438.
In general, with penalized regression (ridge and/or Lasso) people center and standardize the covariates. So the scaling problem tends not to arise in practice. Maybe that’s why we didn’t notice it before.
Output of above code:
Score is a derivative of the objective function. Dividing by nobs can be good if the sample size is very large since it reduces the magnitude. OTOH, it can be bad in smaller models since it makes the function look flatter. The theory comes from the fact that score(x) and score(x)/nobs are both 0 if and only if score(x) is 0. Numerical reality is harder.