scikit-learn: Error when using PCA on MNIST dataset

Description

I tried to use PCA with default configuration on MNIST dataset, It gave memory allocation error. The code given below worked well in the previous version of scikit-learn(0.17). The PCA class chooses the randomised svd solver for this dataset, whereas in the previous version it was just svd_solver=‘full’. If I give explicitly svd_solver=‘full’, it works fine, but when I leave it to find the best solver it chooses randomized svd solver and the code gives memory allocation error when it tries to calculate linalg.lu() using Scipy.

Steps/Code to Reproduce

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA

mnist = fetch_mldata('MNIST original', data_home='./data')
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.3, random_state=0)

pca = PCA(n_components=100)
pca.fit_transform(X_train)

Error

Python(17010,0x7fffa552c3c0) malloc: *** mach_vm_map(size=18446744066138652672) failed (error code=3) *** error: can’t allocate region *** set a breakpoint in malloc_error_break to debug

Versions

Darwin-16.0.0-x86_64-i386-64bit [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] NumPy 1.11.2 SciPy 0.18.1 Scikit-Learn 0.18

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 37 (24 by maintainers)

Most upvoted comments

@amueller TruncatedSVD(algorithm='arpack',n_components=150) generates the fitting model with no error.

I have same problem (malloc:). I pass a large (size>10M)sparse numpy matrix to c=TruncatedSVD() bow = c.fit(X_train_counts) works fine but when I set n_components=30 (with the same large sparse numpy matrix) The problem occurred after updating to sklearn.version=0.18 it was working fine with the same dataset before the update (0.17 or 0.16 I think)

to replicate: from sklearn.decomposition import TruncatedSVD import scipy.sparse r = scipy.sparse.rand(10000,110000) c = TruncatedSVD(n_components=150) c.fit(r)

Replication works fine on a win64 py3.5.2 numpy.version.version: ‘1.11.2’ scipy.version.version: ‘0.18.1’ (sklearn.version): ‘0.18’

MacBookPro12, 8 GB Python 3.5.1 numpy.version.version: ‘1.11.2’ scipy.version.version: ‘0.18.1’ (sklearn.version): ‘0.18’

I tried to run this program on a Virtual Machine with Ubuntu (1 GB RAM), It executed without giving any error. The memory allocation error comes only when I try to run it on Mac OS, even though it has 8GB RAM.