nmslib: Incorrect distances returned for all-zero query
An all-zero query vector will result in NMSLib incorrectly reporting a distance of zero for its nearest neighbours (see example below). Is this related to #187? Is there a suggested workaround?
# Training set (CSR sparse matrix)
X.todense()
# Out:
# matrix([[4., 2., 3., 1., 0., 0., 0., 0., 0.],
# [2., 1., 0., 0., 3., 0., 1., 2., 1.],
# [4., 2., 0., 0., 3., 1., 0., 0., 0.]], dtype=float32)
# Query vector (CSR sparse matrix)
r.todense()
# Out:
# matrix([[0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
# Train and query
import nmslib
index = nmslib.init(
method='hnsw',
space='cosinesimil_sparse_fast',
data_type=nmslib.DataType.SPARSE_VECTOR,
dtype=nmslib.DistType.FLOAT)
index.addDataPointBatch(X)
index.createIndex()
index.knnQueryBatch(r, k=3)
# Out:
# [(array([2, 1, 0], dtype=int32), array([0., 0., 0.], dtype=float32))]
# Note that distances are all 0, which is incorrect!
# Same result for dense training & query vectors.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 24 (13 by maintainers)
Commits related to this issue
- We make cosine similarity eq. 1 if one vector is zero #321 — committed to nmslib/nmslib by searchivarius 4 years ago
- Bugfixes for #321 and #333. — committed to nmslib/nmslib by searchivarius 4 years ago
- additional tests for #336 #333 #321 — committed to nmslib/nmslib by searchivarius 4 years ago
@yurymalkov Sklearn returns a distance of 1 for an all-zero query vector (see example below). This is because they choose not to normalize the query vector if its norm is zero [1]. As a result, the inner product between the training vectors and query vector is zero, and the distance returned is 1. I think NaN is the correct result, but would still prefer receiving an “orthogonal” distance of 1 over a “perfect match” distance of 0.
[1] https://github.com/scikit-learn/scikit-learn/blob/0.19.1/sklearn/preprocessing/data.py#L61