nmslib: Incorrect distances returned for all-zero query

An all-zero query vector will result in NMSLib incorrectly reporting a distance of zero for its nearest neighbours (see example below). Is this related to #187? Is there a suggested workaround?

# Training set (CSR sparse matrix)
X.todense()
# Out:
# matrix([[4., 2., 3., 1., 0., 0., 0., 0., 0.],
#         [2., 1., 0., 0., 3., 0., 1., 2., 1.],
#         [4., 2., 0., 0., 3., 1., 0., 0., 0.]], dtype=float32)

# Query vector (CSR sparse matrix)
r.todense()
# Out:
# matrix([[0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

# Train and query
import nmslib
index = nmslib.init(
    method='hnsw',
    space='cosinesimil_sparse_fast',
    data_type=nmslib.DataType.SPARSE_VECTOR,
    dtype=nmslib.DistType.FLOAT)
index.addDataPointBatch(X)
index.createIndex()
index.knnQueryBatch(r, k=3)
# Out:
# [(array([2, 1, 0], dtype=int32), array([0., 0., 0.], dtype=float32))]

# Note that distances are all 0, which is incorrect!
# Same result for dense training & query vectors.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 24 (13 by maintainers)

Commits related to this issue

Most upvoted comments

@yurymalkov Sklearn returns a distance of 1 for an all-zero query vector (see example below). This is because they choose not to normalize the query vector if its norm is zero [1]. As a result, the inner product between the training vectors and query vector is zero, and the distance returned is 1. I think NaN is the correct result, but would still prefer receiving an “orthogonal” distance of 1 over a “perfect match” distance of 0.

from sklearn.neighbors import NearestNeighbors
import numpy as np

X = np.array([
    [4., 2., 3., 1., 0., 0., 0., 0., 0.],
    [2., 1., 0., 0., 3., 0., 1., 2., 1.],
    [4., 2., 0., 0., 3., 1., 0., 0., 0.]], dtype=np.float32)
r = np.array([
    [0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=np.float32)

nn = NearestNeighbors(n_neighbors=3, algorithm='brute', metric='cosine')
nn.fit(X)
distances, indices = nn.kneighbors(r)
distances
# Prints: array([[1., 1., 1.]], dtype=float32)

[1] https://github.com/scikit-learn/scikit-learn/blob/0.19.1/sklearn/preprocessing/data.py#L61