scikit-learn: BallTree and KDTree query radius out of memory (optimisation or bug)
Describe the bug
The BallTree and KDTree have a method query_radius that returns all the points near a certain radius. I have noticed that when you query via iteration point by point there is no issue with memory allocation but when you perform on the entire array it saturates the memory. I am not sure whether this is a memory leak or requires some form of optimization.
Steps/Code to Reproduce
Example:
import gc
import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
from memory_profiler import memory_usage
import psutil
def get_memory_usage():
return memory_usage(timeout=0.2)[0]
vmem = psutil.virtual_memory()
print('sklearn:', sklearn.__version__)
print('Initial memory usage: %.2f MB' % get_memory_usage())
print('RAM available: %.2f GB' % (vmem.available/1e9))
N = 60000
X = np.random.rand(N, 2) # Random data
print('Allocated X: %.2f MB' % get_memory_usage())
print('X.nbytes == %.2f MB' % (X.nbytes / 1e6))
print('X min %.2f' % X.min())
print('X max %.2f' % X.max())
Now let’s try to find all neighbours:
# query via all elements crashes memory
ind = tree.query_radius(X, r=0.5)
That will exhaust all your memory, increase or decrease N for testing.
Expected Results
If I query via iteration it all works well:
# query by iteration all elements
all = []
for i in range(X.shape[0]):
ind = tree.query_radius(X[i:i+1], r=0.5)
found = ind[0].shape[0]
all.append(found)
assert found>0
print('Allocated with tree: %.2f MB' % get_memory_usage())
About 21 seconds on Colab and around 10 MB additional memory.
Actual Results
Out of memory on a 13 GB RM colab setup. Please change N parameter in case you have a larger setup.
Versions
System: python: 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0] executable: /usr/bin/python3 machine: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
Python dependencies: pip: 21.1.3 setuptools: 57.2.0 sklearn: 0.22.2.post1 numpy: 1.19.5 scipy: 1.4.1 Cython: 0.29.23 pandas: 1.1.5 matplotlib: 3.2.2 joblib: 1.0.1
Built with OpenMP: True
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 15 (8 by maintainers)
Not sure because as you seen in this loop, the index of the next sample to consider is discovered dynamically. So computing the neighborhoods on the fly would not benefit from chunking contiguous data in parallel via threads anymore.
Sure when I come from holiday. The algorithm is basically identical except for not keeping all the ids, I can show how the memory consumption for increasing dataset size.
Hi @jjerphan I wrote a proof of concept to show that is possible but as mentioned it requires to work at the tree level because of the missing functionality. This class is somehow inspired by the current implementation but simplified to remove Cython for easy of replicability.
Happy to help.
Because you throw away the results as you go instead of keeping them in memory: you only keep
ind[0].shape[0]
instead of keeping the list of allind
arrays.