scikit-learn: LogisticRegression memory consumption goes crazy on 0.22+

Describe the bug

LogisticRegression started to consume crazy amounts of RAM on 0.22+.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
import io
import requests
from io import StringIO

import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.utils import shuffle


url = "https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv" # .csv file location
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))

df = df[pd.notnull(df['tags'])]
df = df.sample(frac=0.5, random_state=99).reset_index(drop=True)
df = shuffle(df, random_state=22)
df = df.reset_index(drop=True)
df['class_label'] = df['tags'].factorize()[0]


df_train, df_test = train_test_split(df, test_size=0.2, random_state=40)

X_train = df_train["post"].tolist()
X_test = df_test["post"].tolist()
y_train = df_train["class_label"].tolist()
y_test = df_test["class_label"].tolist()

vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english', binary=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors, y_train)

Expected Results

0.22+ memory consumption should stay reasonably the same.

Actual Results

0.22+ behavior (tried 0.22.0, 0.22.1, 0.22.2.post1):

If run inside a container with limited memory (1-2 GB), the code crashes (by OOM Killer).

Locally, top -o mem shows memory consumption growth to 9GB and continues increasing.

0.21.3 behavior:

Everything works fine within a 1GB container.

top -o mem locally never shows past 1GB memory consumption.

Versions

System: python: 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] executable: /usr/bin/python3 machine: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic

Python dependencies: pip: 19.3.1 setuptools: 44.0.0 sklearn: 0.22.1 numpy: 1.18.1 scipy: 1.4.1 Cython: None pandas: 0.25.3 matplotlib: 3.1.2 joblib: 0.14.1

Built with OpenMP: True

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 32 (26 by maintainers)

Most upvoted comments

If this memory consumption is a problem in the lbfgs solver of scipy, should we open an issue upstream in scipy?

@rth I’ll look into this!

%memit LogisticRegression(multi_class='multinomial', solver="newton-cg").fit(X, y)
peak memory: 458.65 MiB, increment: 284.30 MiB

and

%memit LogisticRegression(multi_class='multinomial', solver="saga").fit(X, y)
peak memory: 244.97 MiB, increment: 73.31 MiB

I consider it as a solved issue.

To me reverting to OVR seems pretty drastic, and maybe we should see if we can fix the issue? We could also try using another off-the-shelf scipy solver? I assume the fortran code requires the bounds array even though there are no bounds? Has anyone looked at the ‘trust-ncg’ or ‘trust-krylov’ algorithms? Are they feasible to use here? liblinear implements a trust-region newton, right?