statsmodels: Ridge regression in GLM gives incorrect results

Describe the bug

Something unexpected occurs when I perform a logistic regression. I believe it has got something to do with the intercept in the model. It only occurs when I use fit_regularised() and, in particular, when L1_wt is zero. In this case, the hidden function _fit_ridge() is used.

Note: Possibly related to https://github.com/statsmodels/statsmodels/issues/3624.

Code Sample, a copy-pastable example if possible

import numpy
import scipy.special
import statsmodels.api
import matplotlib.pyplot
import sys
print("sys version: " + sys.version)
print("statsmodels version: " + statsmodels.__version__ )
print()
numpy.random.seed(123)
n = 50
interceptData = numpy.repeat(1, n)
timeData = numpy.arange(0, n)
sinData = numpy.sin(2 * numpy.pi * timeData / n)
errorData = numpy.random.normal(0, 1, n)
yData = scipy.special.expit(2.3 - 0.1 * timeData + 2.0 * sinData + 0.6 * errorData)
print("yData min: " + str(numpy.min(yData)))
print("yData mean: " + str(numpy.mean(yData)))
print("yData max: " + str(numpy.max(yData)))
for timeAdd in [0, 999]:
    for interceptAdd in [0, 999]:
        for L1_wtValue in [0, 1e-9]:
            print()
            print("timeAdd: " + str(timeAdd) + " interceptAdd: " + str(interceptAdd) + " L1_wtValue: " + str(L1_wtValue))
            interceptDataNew = interceptData + interceptAdd
            timeDataNew = timeData + timeAdd
            exogData = numpy.transpose([interceptDataNew, timeDataNew, sinData])
            model = statsmodels.api.GLM(endog=yData, exog=exogData, family=statsmodels.api.families.Binomial())
            fit = model.fit_regularized(alpha=0, L1_wt=L1_wtValue)
            print("parameters: " + str(fit.params))
            print("fit min: " + str(numpy.min(fit.fittedvalues)))
            print("fit mean: " + str(numpy.mean(fit.fittedvalues)))
            print("fit max: " + str(numpy.max(fit.fittedvalues)))
            print("MSE: " + str(numpy.mean(numpy.power(fit.fittedvalues - yData, 2))))
            matplotlib.pyplot.scatter(timeData, yData, color="red")
            matplotlib.pyplot.scatter(timeData, fit.fittedvalues, color="green")
            matplotlib.pyplot.show()

Expected Output

The models should all produce identical output, with a low MSE.

Output of ``import statsmodels.api statsmodels.version `

sys version: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] statsmodels version: 0.10.2

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (12 by maintainers)

Most upvoted comments

My guess is that is a similar scaling problem as we have with Poisson because of the exp function. In Poisson large exog, make gradient and hessian very large because of the exp and can result in over-and underflow problems. In one version bfgs was changed in scipy which made it less robust to large gradients and many of our Poisson unit test cases started to fail with bfgs.

I don’t like standardizing data when the parameters should have an interpretation, but often some rescaling can be meaning full and makes it numerically much easier. example: In Poisson we had a test case with age as explanatory variable, where age in the sample was around 65 to 90. Our Poisson couldn’t handle this without changing the scale, e.g. subtracting 65 or 75 and/or dividing by 10 (i.e. decades).

It’s still an open question whether we should add internal rescaling, #1131. We traditionally don’t automatically change the data that the user provides. alternative: a fit_transformed method, but that wouldn’t work with fit_regularized.

Another point is how good the starting values for the scipy optimizers are. Often the problem is much more severe when we are not yet close to the optimum. GLM uses a few steps of IRLS as starting values before switching to scipy optimizers. In discrete count models, I spend a lot of time trying to find good starting values, so we can have a least some bad scaling or bad conditioning before our default optimization fails.

josef-pkt on Jan 21, 2020

I have not seen any issues when the intercept is 1. When the intercept is 1000 there was an issue that seems to be resolved by #6438.

In general, with penalized regression (ridge and/or Lasso) people center and standardize the covariates. So the scaling problem tends not to arise in practice. Maybe that’s why we didn’t notice it before.

import numpy as np
import statsmodels.api as sm                                                                                                                                                                                                                                                                        

np.random.seed(123)
n = 50
x = np.ones((n, 2))
x[:, 1] = np.arange(0, n)
y = 1 + x[:, 1] + np.random.normal(size=n)

for method in "newton-cg", "bfgs", None:
    print(method)
    for fun in [sm.OLS, sm.GLM]:
        for L1_wtValue in [0, 1e-9]:
            model = fun(y, x)
            if fun == sm.OLS:
                fit = model.fit_regularized(alpha=0, L1_wt=L1_wtValue)
            else:
                fit = model._fit_ridge(alpha=0, start_params=None, method=method)
            print("parameters: ", np.asarray(fit.params))
    print()

Output of above code:

newton-cg
parameters:  [0.82244348 1.0077889 ]
parameters:  [0.82245482 1.00778855] 
parameters:  [0.82244325 1.0077889 ]
parameters:  [0.82244325 1.0077889 ]                                                                                                                                                                                                                                                                

bfgs
parameters:  [0.82244348 1.0077889 ]  
parameters:  [0.82245482 1.00778855] 
parameters:  [0.82244998 1.0077887 ] 
parameters:  [0.82244998 1.0077887 ]

None
parameters:  [0.82244348 1.0077889 ]  
parameters:  [0.82245482 1.00778855]  
parameters:  [0.82244998 1.0077887 ]   
parameters:  [0.82244998 1.0077887 ]

kshedden on Jan 21, 2020

Score is a derivative of the objective function. Dividing by nobs can be good if the sample size is very large since it reduces the magnitude. OTOH, it can be bad in smaller models since it makes the function look flatter. The theory comes from the fact that score(x) and score(x)/nobs are both 0 if and only if score(x) is 0. Numerical reality is harder.

bashtage on Jan 21, 2020