statsmodels: Linear mixed effects model causes console to die when data are too large
Whenever I run a linear mixed effect model on a large dataset, the model seems to compute. However, when I query it, I find instead that it’s caused the console I am using to lose all its variables and imports. Note that my dataset has approximately 286k lines––big, but still reasonable. Any idea what’s happening?
In [1]: import statsmodels.api as sm
In [2]: import statsmodels.formula.api as smf
In [3]: import pandas as pd
In [4]: data = pd.read_pickle("data.pkl")
This works fine:
In [5]: mod = smf.ols(formula='DV ~ IV', data=data).fit()
In [6]: mod.summary()
Out [7]:
OLS Regression Results
======================================================================
Dep. Variable: DV R-squared: 0.005
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 1020.
Date: Tue, 26 May 2020 Prob (F-statistic): 3.05e-223
Time: 17:12:43 Log-Likelihood: 2.5317e+05
No. Observations: 218417 AIC: -5.063e+05
Df Residuals: 218415 BIC: -5.063e+05
Df Model: 1
Covariance Type: nonrobust
======================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 0.6118 0.000 1386.513 0.000 0.611 0.613
IV[T.leve_1] -0.0152 0.000 -31.933 0.000 -0.016 -0.014
======================================================================
Omnibus: 39541.771 Durbin-Watson: 1.866
Prob(Omnibus): 0.000 Jarque-Bera (JB): 217846.285
Skew: -0.765 Prob(JB): 0.00
Kurtosis: 7.647 Cond. No. 5.26
======================================================================
Now I try this:
In [8]: md = smf.mixedlm("DV ~ IV", data = data, groups=data["grouping_variable_1"], re_formula = "~grouping_variable_2").fit()
In [9]: md.summary()
Traceback (most recent call last):
File "<ipython-input-1-c7c36053ef65>", line 1, in <module>
md.summary()
NameError: name 'md' is not defined
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 19 (10 by maintainers)
See #6766 for performance improvements related to this PR. Using the new code, we can fit a crossed variance components model to the Reddit data linked above. It is around 100x slower than R now, but this is actually a big improvement compared to what we had before. The results agree, as expected. Relevant code is below. To further improve performance would probably require the use of a sparse Cholesky solver.
Here is the Python code:
Here is the corresponding R code: