scikit-learn: LogisticRegression memory consumption goes crazy on 0.22+
Describe the bug
LogisticRegression started to consume crazy amounts of RAM on 0.22+.
Steps/Code to Reproduce
import pandas as pd
import numpy as np
import io
import requests
from io import StringIO
import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.utils import shuffle
url = "https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv" # .csv file location
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
df = df[pd.notnull(df['tags'])]
df = df.sample(frac=0.5, random_state=99).reset_index(drop=True)
df = shuffle(df, random_state=22)
df = df.reset_index(drop=True)
df['class_label'] = df['tags'].factorize()[0]
df_train, df_test = train_test_split(df, test_size=0.2, random_state=40)
X_train = df_train["post"].tolist()
X_test = df_test["post"].tolist()
y_train = df_train["class_label"].tolist()
y_test = df_test["class_label"].tolist()
vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english', binary=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors, y_train)
Expected Results
0.22+ memory consumption should stay reasonably the same.
Actual Results
0.22+ behavior (tried 0.22.0, 0.22.1, 0.22.2.post1):
If run inside a container with limited memory (1-2 GB), the code crashes (by OOM Killer).
Locally, top -o mem
shows memory consumption growth to 9GB and continues increasing.
0.21.3 behavior:
Everything works fine within a 1GB container.
top -o mem
locally never shows past 1GB memory consumption.
Versions
System: python: 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] executable: /usr/bin/python3 machine: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Python dependencies: pip: 19.3.1 setuptools: 44.0.0 sklearn: 0.22.1 numpy: 1.18.1 scipy: 1.4.1 Cython: None pandas: 0.25.3 matplotlib: 3.1.2 joblib: 0.14.1
Built with OpenMP: True
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 32 (26 by maintainers)
If this memory consumption is a problem in the lbfgs solver of scipy, should we open an issue upstream in scipy?
@rth I’ll look into this!
and
I consider it as a solved issue.
Imperfectly done in https://github.com/scipy/scipy/issues/19396.
To me reverting to OVR seems pretty drastic, and maybe we should see if we can fix the issue? We could also try using another off-the-shelf scipy solver? I assume the fortran code requires the bounds array even though there are no bounds? Has anyone looked at the ‘trust-ncg’ or ‘trust-krylov’ algorithms? Are they feasible to use here? liblinear implements a trust-region newton, right?