scikit-learn: IndexError thrown during kmeans init
Description
IndexError thrown during kmeans init.
NumPy’s searchsorted function is being called here, and then the result is used to index into an array here. However, as per the searchsorted documentation:
If there is no suitable index, return either 0 or N (where N is the length of
a
).
It is possible that N can be returned, thus causing the later index to throw an IndexError.
Steps/Code to Reproduce
The below code should trigger it but often doesn’t. I’m unable to share the dataset I’m using that causes it on almost every call.
from sklearn.cluster import MiniBatchKMeans
import numpy as np
Xtr = np.random.rand(100000, 10)
for _ in range(10):
try:
km = MiniBatchKMeans(n_clusters=20000,
init_size=60000,
verbose=1).fit(Xtr)
except Exception as exp:
print exp
Expected Results
No error is thrown. With verbose mode on:
Init 1/3 with method: k-means++
Inertia for init 1/3: 0.001478
Init 2/3 with method: k-means++
Inertia for init 2/3: 0.002398
Init 3/3 with method: k-means++
Inertia for init 3/3: 0.001491
Actual Results
(Taken from internal code and not the example snippet)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-20-a8dbf5b11047> in <module>()
2 features = [8]#range(12, 6, -1)
3 np.random.seed(100)
----> 4 train_errs, valid_errs, aucs = knnResults(X_train, Y_train, k_vals, features)
5
6 mterr = train_errs.mean(axis=2)
<ipython-input-19-16ad5376cf30> in knnResults(X, Y, k_vals, features)
62 max_no_improvement=30,
63 reassignment_ratio=0.04,
---> 64 verbose=1).fit(Xtr)
65 #except Exception as exp:
66 # print exp
/usr/lib/python2.7/site-packages/sklearn/cluster/k_means_.pyc in fit(self, X, y)
1379 random_state=random_state,
1380 x_squared_norms=x_squared_norms,
-> 1381 init_size=init_size)
1382
1383 # Compute the label assignment on the init dataset
/usr/lib/python2.7/site-packages/sklearn/cluster/k_means_.pyc in _init_centroids(X, k, init, random_state, x_squared_norms, init_size)
679 if isinstance(init, string_types) and init == 'k-means++':
680 centers = _k_init(X, k, random_state=random_state,
--> 681 x_squared_norms=x_squared_norms)
682 elif isinstance(init, string_types) and init == 'random':
683 seeds = random_state.permutation(n_samples)[:k]
/usr/lib/python2.7/site-packages/sklearn/cluster/k_means_.pyc in _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials)
112 # Compute distances to center candidates
113 distance_to_candidates = euclidean_distances(
--> 114 X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True)
115
116 # Decide which candidate is the best
IndexError: index 60000 is out of bounds for axis 0 with size 60000
Versions
>>> import platform; print(platform.platform())
Linux-4.9.6-1-ARCH-x86_64-with-glibc2.2.5
>>> import sys; print("Python", sys.version)
('Python', '2.7.13 (default, Dec 21 2016, 07:16:46) \n[GCC 6.2.1 20160830]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.12.0')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.18.1')
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
('Scikit-Learn', '0.18.1')
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 20 (12 by maintainers)
Commits related to this issue
- FIX IndexError due to imprecision in KMeans++ Fixes #8583 I'm not sure how to test this. — committed to jnothman/scikit-learn by jnothman 6 years ago
The fix has been merged in master a few days ago. It’s not part of any release yet. It should be in both 0.20.4 and 0.21.3 incoming releases
I can see the source of the issue in the code.
It comes down to numerical instability:
We do:
This should be equivalent (apart from random order) to:
I think what’s happening here is that
rand_vals
can, due to numerical imprecision issues, slightly exceedstable_cumsum(closest_dist_sq)[-1]
(which should be== closest_dist_sq.sum()
). As a result,searchsorted
returnslen(closest_dist_sq)
, resulting in the error.We could change the random behaviour by using
random_state.choice
(but this might be avoided for backwards compatibility), or else we should simply clip the results of searchsorted.