scikit-learn: LogisticRegression memory consumption goes crazy on 0.22+

Describe the bug

LogisticRegression started to consume crazy amounts of RAM on 0.22+.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
import io
import requests
from io import StringIO

import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.utils import shuffle


url = "https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv" # .csv file location
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))

df = df[pd.notnull(df['tags'])]
df = df.sample(frac=0.5, random_state=99).reset_index(drop=True)
df = shuffle(df, random_state=22)
df = df.reset_index(drop=True)
df['class_label'] = df['tags'].factorize()[0]


df_train, df_test = train_test_split(df, test_size=0.2, random_state=40)

X_train = df_train["post"].tolist()
X_test = df_test["post"].tolist()
y_train = df_train["class_label"].tolist()
y_test = df_test["class_label"].tolist()

vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english', binary=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors, y_train)

Expected Results

0.22+ memory consumption should stay reasonably the same.

Actual Results

0.22+ behavior (tried 0.22.0, 0.22.1, 0.22.2.post1):

If run inside a container with limited memory (1-2 GB), the code crashes (by OOM Killer).

Locally, top -o mem shows memory consumption growth to 9GB and continues increasing.

0.21.3 behavior:

Everything works fine within a 1GB container.

top -o mem locally never shows past 1GB memory consumption.

Versions

System: python: 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] executable: /usr/bin/python3 machine: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic

Python dependencies: pip: 19.3.1 setuptools: 44.0.0 sklearn: 0.22.1 numpy: 1.18.1 scipy: 1.4.1 Cython: None pandas: 0.25.3 matplotlib: 3.1.2 joblib: 0.14.1

Built with OpenMP: True

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 32 (26 by maintainers)

Most upvoted comments

If this memory consumption is a problem in the lbfgs solver of scipy, should we open an issue upstream in scipy?

lorentzenchr on Jun 9, 2022

@rth I’ll look into this!

rubywerman on Jul 1, 2020

%memit LogisticRegression(multi_class='multinomial', solver="newton-cg").fit(X, y)
peak memory: 458.65 MiB, increment: 284.30 MiB

and

%memit LogisticRegression(multi_class='multinomial', solver="saga").fit(X, y)
peak memory: 244.97 MiB, increment: 73.31 MiB

I consider it as a solved issue.

lorentzenchr on Oct 30, 2023

Imperfectly done in https://github.com/scipy/scipy/issues/19396.

jjerphan on Oct 17, 2023

To me reverting to OVR seems pretty drastic, and maybe we should see if we can fix the issue? We could also try using another off-the-shelf scipy solver? I assume the fortran code requires the bounds array even though there are no bounds? Has anyone looked at the ‘trust-ncg’ or ‘trust-krylov’ algorithms? Are they feasible to use here? liblinear implements a trust-region newton, right?

amueller on Jun 30, 2020