scikit-learn: Error when using PCA on MNIST dataset
Description
I tried to use PCA with default configuration on MNIST dataset, It gave memory allocation error. The code given below worked well in the previous version of scikit-learn(0.17). The PCA class chooses the randomised svd solver for this dataset, whereas in the previous version it was just svd_solver=‘full’. If I give explicitly svd_solver=‘full’, it works fine, but when I leave it to find the best solver it chooses randomized svd solver and the code gives memory allocation error when it tries to calculate linalg.lu() using Scipy.
Steps/Code to Reproduce
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
mnist = fetch_mldata('MNIST original', data_home='./data')
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.3, random_state=0)
pca = PCA(n_components=100)
pca.fit_transform(X_train)
Error
Python(17010,0x7fffa552c3c0) malloc: *** mach_vm_map(size=18446744066138652672) failed (error code=3) *** error: can’t allocate region *** set a breakpoint in malloc_error_break to debug
Versions
Darwin-16.0.0-x86_64-i386-64bit [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] NumPy 1.11.2 SciPy 0.18.1 Scikit-Learn 0.18
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 37 (24 by maintainers)
@amueller
TruncatedSVD(algorithm='arpack',n_components=150)
generates the fitting model with no error.I have same problem (malloc:). I pass a large (size>10M)sparse numpy matrix to
c=TruncatedSVD()
bow = c.fit(X_train_counts)
works fine but when I setn_components=30
(with the same large sparse numpy matrix) The problem occurred after updating to sklearn.version=0.18 it was working fine with the same dataset before the update (0.17 or 0.16 I think)to replicate:
from sklearn.decomposition import TruncatedSVD
import scipy.sparse
r = scipy.sparse.rand(10000,110000)
c = TruncatedSVD(n_components=150)
c.fit(r)
Replication works fine on a win64 py3.5.2 numpy.version.version: ‘1.11.2’ scipy.version.version: ‘0.18.1’ (sklearn.version): ‘0.18’
MacBookPro12, 8 GB Python 3.5.1 numpy.version.version: ‘1.11.2’ scipy.version.version: ‘0.18.1’ (sklearn.version): ‘0.18’
I tried to run this program on a Virtual Machine with Ubuntu (1 GB RAM), It executed without giving any error. The memory allocation error comes only when I try to run it on Mac OS, even though it has 8GB RAM.