scikit-learn: KMeans IndexError when using >4 clusters

Description

Fitting a (20242848, 5) numpy array using any more than 4 clusters results in IndexError

Steps/Code to Reproduce

from sklearn.cluster import KMeans
src = gdal.Open(fname)
img = src.ReadAsArray()
X = img.transpose(1,2,0).reshape(-1, img.shape[0])
print X.shape

(20242848, 5)

k_means = KMeans(n_clusters=4, n_jobs=-1, random_state=10)
k_means.fit(X)
labels = k_means.labels_

print labels.shape

(20242848,)

k_means = KMeans(n_clusters=5, n_jobs=-1, random_state=10)
k_means.fit(X)
labels = k_means.labels_

print labels.shape
---------------------------------------------------------------------------
JoblibIndexError                          Traceback (most recent call last)
<ipython-input-22-379757258e42> in <module>()
      1 k_means = KMeans(n_clusters=5, n_jobs=-1, random_state=10)
----> 2 k_means.fit(X)
      3 labels = k_means.labels_
      4 
      5 print labels.shape

[SHORTENED]

IndexError: index 20242848 is out of bounds for axis 0 with size 20242848
___________________________________________________________________________

Expected Results

(20242848, )

Actual Results

Versions

Windows-7-6.1.7601-SP1 (‘Python’, ‘2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]’) (‘NumPy’, ‘1.11.2’) (‘SciPy’, ‘0.18.1’) (‘Scikit-Learn’, ‘0.18’)

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 25 (14 by maintainers)

Most upvoted comments

The issue appears to persist in 0.19.1

Running on a float32 hdf5 data table (opened with pytables) of (175782224, 5), got the following error after several successful initializations:

Init 1/10 with method: k-means++ Inertia for init 1/10: 11.480348 Init 2/10 with method: k-means++ Inertia for init 2/10: 11.541335 Init 3/10 with method: k-means++ Inertia for init 3/10: 11.595790 Init 4/10 with method: k-means++ Inertia for init 4/10: 11.604136 Init 5/10 with method: k-means++ Inertia for init 5/10: 11.635662 Init 6/10 with method: k-means++ Inertia for init 6/10: 11.514906 Init 7/10 with method: k-means++ Traceback (most recent call last): File “”, line 16, in <module> File “C:\Python27\ArcGISx6410.5\lib\site-packages\sklearn\cluster\k_means_.py”, line 1418, in fit init_size=init_size) File “C:\Python27\ArcGISx6410.5\lib\site-packages\sklearn\cluster\k_means_.py”, line 684, in init_centroids x_squared_norms=x_squared_norms) File "C:\Python27\ArcGISx6410.5\lib\site-packages\sklearn\cluster\k_means.py", line 113, in _k_init X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True) IndexError: index 100000 is out of bounds for axis 0 with size 100000

Trying to run MiniBatchKMeans() with the following settings: MiniBatchKMeans(n_clusters=k_init, init=‘k-means++’, max_iter=25, batch_size=10000, verbose=1, compute_labels=False, random_state=123, tol=0.0, max_no_improvement=100, init_size=100000, n_init=10, reassignment_ratio=0.01)

Checked and the k_means_.py file is running version stable_cumsum()

Any thoughts on how to resolve this issue?

I am new to posting here, so I apologize in advance for formatting mistakes.

Many thanks.

Please open a new issue. One of the best way to get good feed-back is to provide a fully standalone snipppet reproducing the problem.

This could be an issue with the cumsum implementation. Is your data float32 by any chance? Can you try running with master, where we recently fixed the precision in cumsum?

If that doesn’t help, could you please report

print(np.sum(closest_dist_sq))
print(closest_dist_sq.cumsum()[-1])
print(rand_vals)

Thanks!

If you need a quick-fix, you can change init from k-means++ to random FYI. That will probably lead to worse results, though.