usearch: Bug: Inconsistencies in Cosine Distance Calculation for Near-Zero Length Vectors

Describe the bug

I have observed inconsistent cosine distance values when working with vectors of very small magnitudes. When adding vectors to the index and then performing a search, the cosine distances between a query vector and these near-zero vectors varied unpredictably, showing values of 0, 1, or infinity.

Environment Details:

  • Usearch Version: 2.8.14
  • Python Version: 3.10.12
  • OS: Ubuntu 22.04.3 LTS
  • CPU: Intel® Xeon® W-2235 CPU @ 3.80GHz

Steps to reproduce

  1. Create an index with 2-dimensional vectors and ‘cos’ metric.
  2. Add vectors of the form [1.0e-x, 0.0] for x ranging from 10 to 29 to the index.
  3. Perform a search with the query vector [1.0, 0.0].

Code Example:

import ast
import numpy as np
from usearch.index import Index, Matches

index = Index(ndim=2, metric='cos', dtype='f32')

kvs = {}
for k in range(10, 30):
   v = np.array([ast.literal_eval("1.0e-%d" % k), 0.0])
   kvs[k] = v
   index.add(k, v)

query = np.array([1.0, 0.0])

matches: Matches = index.search(query, 100)
for m in matches:
   print("distance %s to %s %s: %g" % (query, m.key, kvs[m.key], m.distance))

Execution Example:

distance [1. 0.] to 10 [1.e-10 0.e+00]: 0
distance [1. 0.] to 11 [1.e-11 0.e+00]: 0
...
distance [1. 0.] to 29 [1.e-29 0.e+00]: 1
distance [1. 0.] to 19 [1.e-19 0.e+00]: inf
...

Expected behavior

Consistent and predictable cosine distance values for vectors, regardless of their magnitude.

USearch version

2.8.14 (Python bindings)

Operating System

Ubuntu 22.04.3

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

kamiya@mbj.nifty.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 16 (8 by maintainers)

Commits related to this issue

Most upvoted comments

@lifthrasiir, there shouldn’t be compatibility or any other issues. This class was significantly refactored in the last releases when the macro condition was broken. Prior to this, it definitely worked, and we have enough benchmarks coverage in SimSIMD to suggest improvements over autovectorized code.

As for fast-math settings, I agree, that with SimSIMD back ON, there shouldn’t be anything left to gain from that flag 🤗

Hi @tos-kamiya! Documenting the expected behavior is indeed important, I’ll try to improve there. Introducing an additional runtime epsilon parameter to distance functions, is, however, infeasible. All of the metrics must have identical signature. We just need to find the smallest constant that results in a non-zero square root and use it as an epsilon.