scikit-learn: BallTree and KDTree query radius out of memory (optimisation or bug)

Describe the bug

The BallTree and KDTree have a method query_radius that returns all the points near a certain radius. I have noticed that when you query via iteration point by point there is no issue with memory allocation but when you perform on the entire array it saturates the memory. I am not sure whether this is a memory leak or requires some form of optimization.

Steps/Code to Reproduce

Example:

import gc

import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
from memory_profiler import memory_usage
import psutil

def get_memory_usage():
    return memory_usage(timeout=0.2)[0]

vmem = psutil.virtual_memory()
print('sklearn:', sklearn.__version__)
print('Initial memory usage: %.2f MB' % get_memory_usage())
print('RAM available: %.2f GB' % (vmem.available/1e9))

N = 60000
X = np.random.rand(N, 2)  # Random data

print('Allocated X: %.2f MB' % get_memory_usage())
print('X.nbytes == %.2f MB' % (X.nbytes / 1e6))
print('X min %.2f' % X.min())
print('X max %.2f' % X.max())

Now let’s try to find all neighbours:

# query via all elements crashes memory
ind = tree.query_radius(X, r=0.5) 

That will exhaust all your memory, increase or decrease N for testing.

Expected Results

If I query via iteration it all works well:

# query by iteration all elements
all = []
for i in range(X.shape[0]):
  ind = tree.query_radius(X[i:i+1], r=0.5) 
  found = ind[0].shape[0]
  all.append(found)
  assert found>0
print('Allocated with tree: %.2f MB' % get_memory_usage())

About 21 seconds on Colab and around 10 MB additional memory.

Actual Results

Out of memory on a 13 GB RM colab setup. Please change N parameter in case you have a larger setup.

Versions

System: python: 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0] executable: /usr/bin/python3 machine: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic

Python dependencies: pip: 21.1.3 setuptools: 57.2.0 sklearn: 0.22.2.post1 numpy: 1.19.5 scipy: 1.4.1 Cython: 0.29.23 pandas: 1.1.5 matplotlib: 3.2.2 joblib: 1.0.1

Built with OpenMP: True

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Yes, looking dbscan_inner as we do not need to return the neighbors. I don’t know much about DBSCAN but could computations be streamed in pairwise distances reductions (cc @ogrisel)?

Not sure because as you seen in this loop, the index of the next sample to consider is discovered dynamically. So computing the neighborhoods on the fly would not benefit from chunking contiguous data in parallel via threads anymore.

Sure when I come from holiday. The algorithm is basically identical except for not keeping all the ids, I can show how the memory consumption for increasing dataset size.

Hi @jjerphan I wrote a proof of concept to show that is possible but as mentioned it requires to work at the tree level because of the missing functionality. This class is somehow inspired by the current implementation but simplified to remove Cython for easy of replicability.

Happy to help.

Question out of ignorance why my naive loop does not reach 28 GB? I am iterating through all the elements right ?

Because you throw away the results as you go instead of keeping them in memory: you only keep ind[0].shape[0] instead of keeping the list of all ind arrays.