lucene: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores
Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:
public float compare(float[] v1, float[] v2) {
return (1 + dotProduct(v1, v2)) / 2;
}
While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:
dotProd = dotProduct(v1, v2)
if (dotProd < 0) {
return 1 / (1 + -1*dotProd);
}
return dotProd + 1;
and let the user worry about normalization
Related issue: https://github.com/opensearch-project/k-NN/issues/865
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 84 (64 by maintainers)
Commits related to this issue
- GITHUB#12342 Add new maximum inner product vector similarity method (#12479) The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards informatio... — committed to apache/lucene by benwtrent 10 months ago
- GITHUB#12342 Add new maximum inner product vector similarity method (#12479) The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards informatio... — committed to apache/lucene by benwtrent 10 months ago
I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.
Just to confirm the gamma numbers for the reversed case (as its the only one where transformed does better). I ran it again, but up to 2000 neighbors explored. The latency & recall graphs stay steady.
Yes, insertion order. So which docs are indexed first.
Ordered is sorted by magnitude in ascending order. So smallest magnitudes are indexed first.
Reverse is ordered reverse. So descending order by magnitudes (largest first)
@benwtrent thank you for re-running, much appreciated! The results look the same, but do we agree that after the transform is done, all three similarities / distances should be equivalent for the purpose of k-NN search? This is due to all documents being normalized to the same length (and query length is the same during the search, simply because it’s the very same query). The equivalence is more obvious when queries and documents all have unit length. However, I think it is still true assuming consistent normalization of documents.
I found another dataset, Yandex Text-to-image: https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search
I tested against the first 500_000 values in the 1M dataset.
It utilizes inner-product, looking at magnitudes, they are all < 1. So this dataset might not be that useful 😕. The magnitudes range from 0.79 - 0.99.
I am just looking for more realistic inner-product search data sets and found this one.
Yandex Image-to-Text
code for transforming yandex fbin
@msokolov In our BEIR paper we talked about this: https://arxiv.org/abs/2104.08663
The issue with cosine similarity is that it just encodes the topic. For the query
What is Pytorch, and you have two docs: A:Pytorch is a frameworkB:
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interfaceA cosine similarity model would retrieve A (it is more on-topic). A dotproduct model would retrieve B, as it is on-topic as well and contains more information on the topic (represented as a longer vector).
I.e. cossim(query, A) > cossim(query, B) but dotprod(query, A) < dotprod(query, B).
So dotproduct models can encode topic + quantity/quality of information, while cosine similarity models just encode topic match (which isn’t perfect for retrieval).
OK ran on two more datasets, I only ran over the first 100k documents in my data sets.
EDIT: found a minor bug, the ‘baseline’ queries were being overwritten by the ‘transform’ ones (the ones with just a
0at the 769th index). I reran to verify if results were effected; the weren’t. Mainly because to measure recall, we take the brute-force kNN from the query file and compare to the approx-kNN. The data sets and the index built were not effected by this bug. If any more are found, I will happily re-run. Putting my M1 macbook to work 😃.I think this exhausts our testing for Cohere, we need to find additional data if this is not considered enough.
EN
JA
updating data creation (after download)
Useful parsing & plotting functions
To use them:
data downloading script
@benwtrent @jmazanec15 after fiddling for 10 minutes with Google docs, I made a plot with Chat GPT (maybe I should have asked Bard instead)? Anyways, see the attached notebook and the attached picture. There seems to be virtually no difference between the original and the transformed runs.
That said, for high recall area (latency >= 15). The original run is better (compare two rightmost data points).
In terms of additional evidence required: I think we might want to do due diligence and to run more extensive tests, e.g., several datasets. @jmazanec15 publicizing results won’t harm either, b/c we don’t want people to do extra work when extra work is not needed. Moreover, if some clients do want to apply the transform for some reason, it can be easily done outside of the retrieval system, there’s no need for the retrieval system to support it directlry.
@searchivarius yes pareto curve will be good. I think this is possible in luceneutil. Want to first just compare with results vespa got.
@benwtrent @jmazanec15
You need to look at the curves. The transform changes both recall and latency. The key question is: are we still on the same Pareto curve or not. Because if we are, getting a higher recall is merely a matter of choosing a larger M or ef, you do not need to support transformer in Lucene.
Update: I passed -forceMerge to KnnGraphTester and confirmed recall was again 0.991, confirming results above.
@benwtrent I uncommented these 2 lines: https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java#L699-L702 and set max buffer to 1000000.
This is the index as well:
Thank you for the deep information @searchivarius .
eagerly waiting your results @jmazanec15 😃
thank you @jmazanec15 : there’s also an unpublished paper (I can share the preprint privately) where we benchmarked HNSW for maximum inner product search on 3 datasets and it was just fine (for this paper I did try the reduction to the cosine similarity and I also got poorer outcomes). In my thesis, I benchmarked SW-graph (which is pretty much HNSW when it comes to peculiarities of handling the inner product search) using an inner-product like similarity (fusion of BM25 and MODEL1 scores) and it was fine. See the black asterisk run in Figure 3.2.
Moreover, HNSW and SW-graph were tested with non-metric similarities (see again my thesis and references therein) as well as Yury Malkov’s HNSW paper. These methods established SOTA results as well. There is also an extract from the thesis (published separately) that focuses specifically on search with non-metric similarities. Again, things just work..
One may wonder why, right? I think for real datasets the quirky distances don’t deviate from the Euclidean distances all that much so the minimal set of geometric properties required for graph based retrieval is preserved (and no I don’t think the triangle inequality is required).
Specifically, for the inner product search the outcomes are pretty close (in many cases) to the outcomes where the inner product search is replaced with cosine similarity (which is equivalent to L2 search) Why? Because with real embeddings the magnitude of vectors doesn’t change all that much.
That said, there are of course degenerate cases (I know one, but embedding models don’t produce such weirdness) where HNSW won’t work with MIPS (or rather recall will be low). However, I am not aware of any realistic one. If you have some interesting examples of real datasets where direct application of HNSW/SW-graph fails, I would love to have a look.
REGARDING THE SCORE sign: dot-product scores need not be normalized, but the sign can be changed when the results is returned to the user.
@jmazanec15 have you done any digging into the dot-product scaling and if it provides good recall in the MAX-INNER-PRODUCT search use-case?
https://blog.vespa.ai/announcing-maximum-inner-product-search/ && https://towardsdatascience.com/maximum-inner-product-search-using-nearest-neighbor-search-algorithms-c125d24777ef
Implies there might be some weirdness with HNSW and raw MIP. I am honestly not 100% sure if Lucene has this issue specifically with HNSW.
A key observation in MIP is that a
vectoris no longer closest to itself, but instead it would be much closer to2*vectorthan justvector.If you are utilizing hybrid search, negating WAND/MAX_SCORE will slow things down significantly.
We should protect folks from shooting themselves in the foot.
I agree, there will be confusion. What do you think @uschindler & @msokolov ?
Being able to do non-normalized dot-product is an important aspect of recommendation engines and vector search as a whole. My imagination is too poor to come up with a better solution than adding a new similarity function that uses dot-product under the hood and scales differently.
I say we have that problem now. Vector scores and BM25 are nowhere near on the same scale. Folks need to adjust their boosts accordingly, regardless.
I agree, BWC is a big deal here. And I suggest we create a new similarity that just uses dot product under the hood. Call it
maximum-inner-product.Quoting a SLACK conversation with @nreimers: