usearch: Bug: Inconsistencies in Cosine Distance Calculation for Near-Zero Length Vectors
Describe the bug
I have observed inconsistent cosine distance values when working with vectors of very small magnitudes. When adding vectors to the index and then performing a search, the cosine distances between a query vector and these near-zero vectors varied unpredictably, showing values of 0, 1, or infinity.
Environment Details:
- Usearch Version: 2.8.14
- Python Version: 3.10.12
- OS: Ubuntu 22.04.3 LTS
- CPU: Intel® Xeon® W-2235 CPU @ 3.80GHz
Steps to reproduce
- Create an index with 2-dimensional vectors and ‘cos’ metric.
- Add vectors of the form
[1.0e-x, 0.0]
for x ranging from 10 to 29 to the index. - Perform a search with the query vector
[1.0, 0.0]
.
Code Example:
import ast
import numpy as np
from usearch.index import Index, Matches
index = Index(ndim=2, metric='cos', dtype='f32')
kvs = {}
for k in range(10, 30):
v = np.array([ast.literal_eval("1.0e-%d" % k), 0.0])
kvs[k] = v
index.add(k, v)
query = np.array([1.0, 0.0])
matches: Matches = index.search(query, 100)
for m in matches:
print("distance %s to %s %s: %g" % (query, m.key, kvs[m.key], m.distance))
Execution Example:
distance [1. 0.] to 10 [1.e-10 0.e+00]: 0
distance [1. 0.] to 11 [1.e-11 0.e+00]: 0
...
distance [1. 0.] to 29 [1.e-29 0.e+00]: 1
distance [1. 0.] to 19 [1.e-19 0.e+00]: inf
...
Expected behavior
Consistent and predictable cosine distance values for vectors, regardless of their magnitude.
USearch version
2.8.14 (Python bindings)
Operating System
Ubuntu 22.04.3
Hardware architecture
x86
Which interface are you using?
Python bindings
Contact Details
kamiya@mbj.nifty.com
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: open
- Created 7 months ago
- Comments: 16 (8 by maintainers)
Commits related to this issue
- Fix: SimSIMD dispatch Related to #320 — committed to unum-cloud/usearch by ashvardanian 5 months ago
- Build: Released 2.8.16 [skip ci] ## [2.8.16](https://github.com/unum-cloud/usearch/compare/v2.8.15...v2.8.16) (2024-01-24) ### Docs * Downloads numbers ([13cc624](https://github.com/unum-cloud/usea... — committed to unum-cloud/usearch by semantic-release-bot 5 months ago
@lifthrasiir, there shouldn’t be compatibility or any other issues. This class was significantly refactored in the last releases when the macro condition was broken. Prior to this, it definitely worked, and we have enough benchmarks coverage in SimSIMD to suggest improvements over autovectorized code.
As for fast-math settings, I agree, that with SimSIMD back ON, there shouldn’t be anything left to gain from that flag 🤗
Hi @tos-kamiya! Documenting the expected behavior is indeed important, I’ll try to improve there. Introducing an additional runtime
epsilon
parameter to distance functions, is, however, infeasible. All of the metrics must have identical signature. We just need to find the smallest constant that results in a non-zero square root and use it as an epsilon.