statsmodels: Logit much slower than sklearn LogisticRegression
Describe the bug
Performance bug: statsmodels Logit regression is 10-100x slower than scikit-learn LogisticRegression.
I benchmarked both using L-BFGS solver, with the same number of iterations, and the same other settings as far as I can tell.
The speed difference seems to increase with larger data sets, and increase by a huge amount when using fit_regularized(). On dataset 10k samples x 1k features, fit_regularized() with l1 penalty can take more than a day, where scikit-learn with l2 penalty takes a couple of seconds.
On my production data, statsmodels is so slow as to be essentially unusable.
I’m also wondering if the two version of logit somehow work in a completely different way: is this only a performance bug, or a performance difference that is due to a significant difference in the way the optimization problem is posed? (and consequently with different results)
Code Sample, a copy-pastable example if possible
import statsmodels.api as sm
import sklearn.datasets
import sklearn.linear_model # sklearn 0.22.1
X_train, y_train = sklearn.datasets.make_classification(n_samples=30000, n_features=2048, n_informative=2048, n_redundant=0, n_repeated=0)
%time model = sm.Logit(y_train, X_train).fit(method='lbfgs', pgtol=0.0001, maxiter=10, disp=True, qc_verbose=True)
# CPU times: user 37.5 s, sys: 1.4 s, total: 38.8 s Wall time: 9.98 s
%time model = sklearn.linear_model.LogisticRegression(max_iter=10, penalty='none', verbose=1).fit(X_train, y_train)
# CPU times: user 1.22 s, sys: 7.95 ms, total: 1.23 s Wall time: 339 ms
Both stop at max_iter in this example, so the result is not affected by the convergence criteria. If max_iter is increased, scikit-learn converges in 35 iterations annd 3.6 CPU seconds, while statsmodels converges in 17 iterations and 38.8 CPU seconds.
Environment: AWS g4dn instance, AWS Deep Learning AMI Version 29.0
Expected Output
I would expect the run time to be similar when using the same solver and same data.
Output of import statsmodels.api as sm; sm.show_versions()
INSTALLED VERSIONS
Python: 3.6.10.final.0 OS: Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC 2020 x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8
statsmodels
Installed: 0.11.0 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/statsmodels)
Required Dependencies
cython: 0.29.15 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/Cython) numpy: 1.18.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numpy) scipy: 1.4.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/scipy) pandas: 1.0.3 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas) dateutil: 2.8.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/dateutil) patsy: 0.5.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/patsy)
Optional Dependencies
matplotlib: 3.1.3 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/matplotlib) backend: module://ipykernel.pylab.backend_inline cvxopt: 1.2.0 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/cvxopt) joblib: 0.14.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/joblib)
Developer Tools
IPython: 7.13.0 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/IPython) jinja2: 2.11.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/jinja2) sphinx: 2.4.4 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sphinx) pygments: 2.6.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pygments) pytest: 5.4.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pytest) virtualenv: Not installed
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (14 by maintainers)
I checked with the profiler. In the last case nobs=20000, k=2000, most of the time for setting up the model is in np.linalg.svd
eg.
I don’t have a very recent sklearn
It looks like statsmodels is spending a lot of time in setting up the model
fit itself is a bit slower per iteration in statsmodels, but converges faster.
reversing and adding constant to statsmodels X