scikit-learn: NearestNeighbors radius_neighbors memory leaking
Description
NearestNeighbors uses a large chunck of memory in run time without releasing it even after calling del
or assigning a empty array to the object variable. The memory will how ever be released after the python process is terminated.
I noticed the bug when I was trying to fit a DBSCAN
model and by looking deeper into the issue I was able to reproduce the same memory leak with a random data by using NearestNeighbors.radius_neighbors()
method.
The memory leaked in a 12GB RAM machine was: 10%
.
Steps/Code to Reproduce
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.random.rand(20000, 2) # Random data
neighbors_model = NearestNeighbors()
neighbors_model.fit(X)
neighborhoods = neighbors_model.radius_neighbors(X, 0.5, return_distance=False)
del neighborhoods
del neighbors_model
del X
while True:
pass
The algorithm modes that were producing the leak were:
- auto
- ball_tree
- kd_tree
With the algorithm brute the memory leak didn’t happen but when increasing the data size by factor of 10 the following error happened
Expected Results
MemoryError Traceback (most recent call last)
<ipython-input-14-df89794bb3b3> in <module>()
----> 1 neighborhoods = neighbors_model.radius_neighbors(X, 0.5, return_distance=False)
/usr/local/lib/python3.5/dist-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance)
588 if self.effective_metric_ == 'euclidean':
589 dist = pairwise_distances(X, self._fit_X, 'euclidean',
--> 590 n_jobs=self.n_jobs, squared=True)
591 radius *= radius
592 else:
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1245 func = partial(distance.cdist, metric=metric, **kwds)
1246
-> 1247 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1248
1249
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1088 if n_jobs == 1:
1089 # Special case to avoid picklability checks in delayed
-> 1090 return func(X, Y, **kwds)
1091
1092 # TODO: in some cases, backend='threading' may be appropriate
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
244 YY = row_norms(Y, squared=True)[np.newaxis, :]
245
--> 246 distances = safe_sparse_dot(X, Y.T, dense_output=True)
247 distances *= -2
248 distances += XX
/usr/local/lib/python3.5/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
MemoryError:
Versions
- Linux-4.13.0-39-generic-x86_64-with-Ubuntu-16.04-xenial
- Python 3.5.2 (default, Nov 23 2017, 16:37:01)
- [GCC 5.4.0 20160609]
- NumPy 1.14.3
- SciPy 1.0.1
- Scikit-Learn 0.19.1
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 21 (19 by maintainers)
Sorry, I shouldn’t have let #11056 close this. I think that solved a separate issue.