scikit-learn: KMeans IndexError when using >4 clusters
Description
Fitting a (20242848, 5) numpy array using any more than 4 clusters results in IndexError
Steps/Code to Reproduce
from sklearn.cluster import KMeans
src = gdal.Open(fname)
img = src.ReadAsArray()
X = img.transpose(1,2,0).reshape(-1, img.shape[0])
print X.shape
(20242848, 5)
k_means = KMeans(n_clusters=4, n_jobs=-1, random_state=10)
k_means.fit(X)
labels = k_means.labels_
print labels.shape
(20242848,)
k_means = KMeans(n_clusters=5, n_jobs=-1, random_state=10)
k_means.fit(X)
labels = k_means.labels_
print labels.shape
---------------------------------------------------------------------------
JoblibIndexError Traceback (most recent call last)
<ipython-input-22-379757258e42> in <module>()
1 k_means = KMeans(n_clusters=5, n_jobs=-1, random_state=10)
----> 2 k_means.fit(X)
3 labels = k_means.labels_
4
5 print labels.shape
[SHORTENED]
IndexError: index 20242848 is out of bounds for axis 0 with size 20242848
___________________________________________________________________________
Expected Results
(20242848, )
Actual Results
Versions
Windows-7-6.1.7601-SP1 (‘Python’, ‘2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]’) (‘NumPy’, ‘1.11.2’) (‘SciPy’, ‘0.18.1’) (‘Scikit-Learn’, ‘0.18’)
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 25 (14 by maintainers)
The issue appears to persist in 0.19.1
Running on a float32 hdf5 data table (opened with pytables) of (175782224, 5), got the following error after several successful initializations:
Init 1/10 with method: k-means++ Inertia for init 1/10: 11.480348 Init 2/10 with method: k-means++ Inertia for init 2/10: 11.541335 Init 3/10 with method: k-means++ Inertia for init 3/10: 11.595790 Init 4/10 with method: k-means++ Inertia for init 4/10: 11.604136 Init 5/10 with method: k-means++ Inertia for init 5/10: 11.635662 Init 6/10 with method: k-means++ Inertia for init 6/10: 11.514906 Init 7/10 with method: k-means++ Traceback (most recent call last): File “”, line 16, in <module> File “C:\Python27\ArcGISx6410.5\lib\site-packages\sklearn\cluster\k_means_.py”, line 1418, in fit init_size=init_size) File “C:\Python27\ArcGISx6410.5\lib\site-packages\sklearn\cluster\k_means_.py”, line 684, in init_centroids x_squared_norms=x_squared_norms) File "C:\Python27\ArcGISx6410.5\lib\site-packages\sklearn\cluster\k_means.py", line 113, in _k_init X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True) IndexError: index 100000 is out of bounds for axis 0 with size 100000
Trying to run MiniBatchKMeans() with the following settings: MiniBatchKMeans(n_clusters=k_init, init=‘k-means++’, max_iter=25, batch_size=10000, verbose=1, compute_labels=False, random_state=123, tol=0.0, max_no_improvement=100, init_size=100000, n_init=10, reassignment_ratio=0.01)
Checked and the k_means_.py file is running version stable_cumsum()
Any thoughts on how to resolve this issue?
I am new to posting here, so I apologize in advance for formatting mistakes.
Many thanks.
Please open a new issue. One of the best way to get good feed-back is to provide a fully standalone snipppet reproducing the problem.
This could be an issue with the cumsum implementation. Is your data float32 by any chance? Can you try running with master, where we recently fixed the precision in cumsum?
If that doesn’t help, could you please report
Thanks!
If you need a quick-fix, you can change
init
fromk-means++
torandom
FYI. That will probably lead to worse results, though.