statsmodels: Linear mixed effects model causes console to die when data are too large

Whenever I run a linear mixed effect model on a large dataset, the model seems to compute. However, when I query it, I find instead that it’s caused the console I am using to lose all its variables and imports. Note that my dataset has approximately 286k lines––big, but still reasonable. Any idea what’s happening?

In [1]: import statsmodels.api as sm
In [2]: import statsmodels.formula.api as smf
In [3]: import pandas as pd
In [4]: data = pd.read_pickle("data.pkl")

This works fine:

In [5]: mod = smf.ols(formula='DV ~ IV', data=data).fit()
In [6]: mod.summary()

Out [7]: 

                            OLS Regression Results                            
======================================================================
Dep. Variable:                DV   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.005
Method:                          Least Squares   F-statistic:                     1020.
Date:                Tue, 26 May 2020   Prob (F-statistic):          3.05e-223
Time:                        17:12:43   Log-Likelihood:             2.5317e+05
No. Observations:              218417   AIC:                        -5.063e+05
Df Residuals:                  218415   BIC:                        -5.063e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================
                         coef    std err          t            P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept       0.6118     0.000   1386.513     0.000       0.611       0.613
IV[T.leve_1]    -0.0152  0.000    -31.933      0.000      -0.016      -0.014
======================================================================
Omnibus:                    39541.771   Durbin-Watson:                   1.866
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           217846.285
Skew:                          -0.765   Prob(JB):                         0.00
Kurtosis:                       7.647   Cond. No.                         5.26
======================================================================

Now I try this:

 In [8]: md = smf.mixedlm("DV ~ IV", data = data,  groups=data["grouping_variable_1"], re_formula = "~grouping_variable_2").fit()

In [9]: md.summary()

Traceback (most recent call last):

  File "<ipython-input-1-c7c36053ef65>", line 1, in <module>
    md.summary()

NameError: name 'md' is not defined

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

See #6766 for performance improvements related to this PR. Using the new code, we can fit a crossed variance components model to the Reddit data linked above. It is around 100x slower than R now, but this is actually a big improvement compared to what we had before. The results agree, as expected. Relevant code is below. To further improve performance would probably require the use of a sparse Cholesky solver.

Here is the Python code:

import pandas as pd
import statsmodels.api as sm
import numpy as np

df = pd.read_excel("reddit_masked_data.xlsx")
df = df.dropna()

oo = np.ones(df.shape[0])
vcf = {"redditor": "0+C(redditor)", "subreddit": "0+C(subreddit)"}
model = sm.MixedLM.from_formula("valence ~ condition", vc_formula=vcf, groups=oo, use_sparse=True, data=df)
result = model.fit()

# This should match the standard deviations of the random effects
# reported by R.  Also compare our scale parameter to the residual
# variance in R.
print(np.sqrt(result.vcomp))

Here is the corresponding R code:

library(readxl)
library(lme4)                                                                                                                                                                                                                                                                   df = read_excel("reddit_masked_data.xlsx")
df = df[complete.cases(df),]

m = lmer("valence ~ condition + (1 | redditor) + (1 | subreddit)", data=df)