statsmodels: Logit much slower than sklearn LogisticRegression

Describe the bug

Performance bug: statsmodels Logit regression is 10-100x slower than scikit-learn LogisticRegression.

I benchmarked both using L-BFGS solver, with the same number of iterations, and the same other settings as far as I can tell.

The speed difference seems to increase with larger data sets, and increase by a huge amount when using fit_regularized(). On dataset 10k samples x 1k features, fit_regularized() with l1 penalty can take more than a day, where scikit-learn with l2 penalty takes a couple of seconds.

On my production data, statsmodels is so slow as to be essentially unusable.

I’m also wondering if the two version of logit somehow work in a completely different way: is this only a performance bug, or a performance difference that is due to a significant difference in the way the optimization problem is posed? (and consequently with different results)

Code Sample, a copy-pastable example if possible

import statsmodels.api as sm
import sklearn.datasets
import sklearn.linear_model # sklearn 0.22.1
X_train, y_train = sklearn.datasets.make_classification(n_samples=30000, n_features=2048, n_informative=2048, n_redundant=0, n_repeated=0)
%time model = sm.Logit(y_train, X_train).fit(method='lbfgs', pgtol=0.0001, maxiter=10, disp=True, qc_verbose=True)
# CPU times: user 37.5 s, sys: 1.4 s, total: 38.8 s Wall time: 9.98 s
%time model = sklearn.linear_model.LogisticRegression(max_iter=10, penalty='none', verbose=1).fit(X_train, y_train)
# CPU times: user 1.22 s, sys: 7.95 ms, total: 1.23 s Wall time: 339 ms

Both stop at max_iter in this example, so the result is not affected by the convergence criteria. If max_iter is increased, scikit-learn converges in 35 iterations annd 3.6 CPU seconds, while statsmodels converges in 17 iterations and 38.8 CPU seconds.

Environment: AWS g4dn instance, AWS Deep Learning AMI Version 29.0

Expected Output

I would expect the run time to be similar when using the same solver and same data.

Output of `import statsmodels.api as sm; sm.show_versions()`

INSTALLED VERSIONS

Python: 3.6.10.final.0 OS: Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC 2020 x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8

statsmodels

Installed: 0.11.0 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/statsmodels)

Required Dependencies

cython: 0.29.15 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/Cython) numpy: 1.18.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numpy) scipy: 1.4.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/scipy) pandas: 1.0.3 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas) dateutil: 2.8.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/dateutil) patsy: 0.5.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/patsy)

Optional Dependencies

matplotlib: 3.1.3 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/matplotlib) backend: module://ipykernel.pylab.backend_inline cvxopt: 1.2.0 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/cvxopt) joblib: 0.14.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/joblib)

Developer Tools

IPython: 7.13.0 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/IPython) jinja2: 2.11.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/jinja2) sphinx: 2.4.4 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sphinx) pygments: 2.6.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pygments) pytest: 5.4.1 (/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pytest) virtualenv: Not installed

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 20 (14 by maintainers)

Most upvoted comments

I checked with the profiler. In the last case nobs=20000, k=2000, most of the time for setting up the model is in np.linalg.svd

eg.

%time s = np.linalg.matrix_rank(X_train)
Wall time: 4.41 s

josef-pkt on Jun 14, 2020

I don’t have a very recent sklearn

It looks like statsmodels is spending a lot of time in setting up the model

fit itself is a bit slower per iteration in statsmodels, but converges faster.

import sklearn.linear_model
from statsmodels.discrete.discrete_model import Logit
import sklearn.datasets

k = 1000
X_train, y_train = sklearn.datasets.make_classification(n_samples=30000, n_features=k, n_informative=k, n_redundant=0, n_repeated=0)

%time model = Logit(y_train, X_train)
%time model.fit(method='lbfgs', pgtol=0.0001, disp=True, skip_hessian=True)

%time model = sklearn.linear_model.LogisticRegression(solver='lbfgs', penalty='none', verbose=1).fit(X_train, y_train)
Wall time: 4.19 s
Wall time: 956 ms
Wall time: 1.03 s

reversing and adding constant to statsmodels X

k = 2000
X_train, y_train = sklearn.datasets.make_classification(n_samples=20000, n_features=k, n_informative=k, n_redundant=0, n_repeated=0)

%time model = sklearn.linear_model.LogisticRegression(max_iter=100, solver='lbfgs', penalty='none', verbose=1).fit(X_train, y_train)

X_train[:, 0] = 1
%time model = Logit(y_train, X_train, has_const=True)
%time model.fit(maxiter=100, method='lbfgs', pgtol=0.0001, disp=True, skip_hessian=True)

Wall time: 1.4 s
Wall time: 4.12 s
Wall time: 1.44 s