scikit-learn: Kmeans and memory overflowing
Description
I am wondering if clustering with kmeans for 250000 samples into 6000 cluster is a too hard problem to compute because it kills even server with 12 cores, 258GB RAM and 60GB swap.
Similar “questions”:
- python memory error for kmeans in scikit-learn
- Memory Error when fitting the data using sklearn package
Code to Reproduce
The use-case us following:
import numpy as np
from sklearn import cluster
locations = np.random.random((250000, 2)) * 5
kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
verbose=True, n_jobs=20, copy_x=False,
precompute_distances=False)
kmean.fit(locations)
print (kmean.cluster_centers_)
Actual Results
Iteration 35, inertia 156.384475435
center shift 7.768886e-03 within tolerance 2.084699e-04
Traceback (most recent call last):
File "test_kmeans.py", line 8, in <module>
kmean.fit(locations)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 889, in fit
return_n_iter=True)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 362, in k_means
for seed in seeds)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 768, in __call__
self.retrieve()
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 719, in retrieve
raise exception
sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/mnt/datagrid/personal/borovec/Dropbox/Workspace/Uplus_fraud-monitoring/test_kmeans.py in <module>()
3
4 locations = np.random.random((250000, 2)) * 5
5 kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
6 verbose=True, n_jobs=20, copy_x=False,
7 precompute_distances=False)
----> 8 kmean.fit(locations)
9 print (kmean.cluster_centers_)
10
11
12
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in fit(self=KMeans(algorithm='auto', copy_x=False, init='k-m...
random_state=None, tol=0.0001, verbose=True), X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), y=None)
884 X, n_clusters=self.n_clusters, init=self.init,
885 n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
886 precompute_distances=self.precompute_distances,
887 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
888 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 889 return_n_iter=True)
890 return self
891
892 def fit_predict(self, X, y=None):
893 """Compute cluster centers and predict cluster index for each sample.
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in k_means(X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), n_clusters=6000, init='k-means++', precompute_distances=False, n_init=10, max_iter=150, verbose=True, tol=0.00020846993669604294, random_state=<mtrand.RandomState object>, copy_x=False, n_jobs=20, algorithm='elkan', return_n_iter=True)
357 verbose=verbose, tol=tol,
358 precompute_distances=precompute_distances,
359 x_squared_norms=x_squared_norms,
360 # Change seed to ensure variety
361 random_state=seed)
--> 362 for seed in seeds)
seeds = array([ 968587040, 226617041, 2063896048, 6552... 393005117, 134324550, 14152465, 2054736812])
363 # Get results with the lowest inertia
364 labels, inertia, centers, n_iters = zip(*results)
365 best = np.argmin(inertia)
366 best_labels = labels[best]
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=20), iterable=<generator object <genexpr>>)
763 if pre_dispatch == "all" or n_jobs == 1:
764 # The iterable was consumed all at once by the above for loop.
765 # No need to wait for async callbacks to trigger to
766 # consumption.
767 self._iterating = False
--> 768 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=20)>
769 # Make sure that we get a last message telling us we are done
770 elapsed_time = time.time() - self._start_time
771 self._print('Done %3i out of %3i | elapsed: %s finished',
772 (len(self._output), len(self._output),
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
MemoryError Tue Oct 17 16:11:14 2017
PID: 18062 Python 2.7.9: /mnt/home.dokt/borovji3/vEnv/bin/python
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <function _kmeans_single_elkan>
args = (memmap([[-1.86344999, 0.05621132],
[ 0....1.20243728],
[ 0.97877704, 1.24561138]]), 6000)
kwargs = {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219])}
self.items = [(<function _kmeans_single_elkan>, (memmap([[-1.86344999, 0.05621132],
[ 0....1.20243728],
[ 0.97877704, 1.24561138]]), 6000), {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219])})]
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in _kmeans_single_elkan(X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), n_clusters=6000, max_iter=150, init='k-means++', verbose=True, x_squared_norms=memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219]), random_state=<mtrand.RandomState object>, tol=0.00020846993669604294, precompute_distances=False)
394 x_squared_norms=x_squared_norms)
395 centers = np.ascontiguousarray(centers)
396 if verbose:
397 print('Initialization complete')
398 centers, labels, n_iter = k_means_elkan(X, n_clusters, centers, tol=tol,
--> 399 max_iter=max_iter, verbose=verbose)
max_iter = 150
verbose = True
400 inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
401 return labels, inertia, centers, n_iter
402
403
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/_k_means_elkan.so in sklearn.cluster._k_means_elkan.k_means_elkan (sklearn/cluster/_k_means_elkan.c:6961)()
225
226
227
228
229
--> 230
231
232
233
234
MemoryError:
___________________________________________________________________________
Versions
Python 2.7.9 (default, Mar 1 2015, 12:57:24) [GCC 4.9.2] on linux2 numpy==1.13.1 scipy==0.19.1 scikit-learn==0.18.1
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 20 (18 by maintainers)
try
algorithm="full"
I confirm that with #11950 I can run your script on my laptop without memory error.
@Borda I used k-means from Opencv. It has no memory problems and is parallelized by default.
it seems to help with memory usage but on the other hand, it seems to freeze somewhere after initiation with zero CPU usage like it would freeze on synchronization…
(if this solves the problem, please add this as a suggestion to the documentation)