lucene: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:

public float compare(float[] v1, float[] v2) {
  return (1 + dotProduct(v1, v2)) / 2;
}

While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:

dotProd = dotProduct(v1, v2)

if (dotProd < 0) {
  return 1 / (1 + -1*dotProd);
}
return dotProd + 1;

and let the user worry about normalization

Related issue: https://github.com/opensearch-project/k-NN/issues/865

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 84 (64 by maintainers)

Commits related to this issue

Most upvoted comments

Is that negative scores here are OK as the optimization constraints traditionally required do not apply.

I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.

Just to confirm the gamma numbers for the reversed case (as its the only one where transformed does better). I ran it again, but up to 2000 neighbors explored. The latency & recall graphs stay steady.

image

image

Yes, insertion order. So which docs are indexed first.

Ordered is sorted by magnitude in ascending order. So smallest magnitudes are indexed first.

Reverse is ordered reverse. So descending order by magnitudes (largest first)

@benwtrent thank you for re-running, much appreciated! The results look the same, but do we agree that after the transform is done, all three similarities / distances should be equivalent for the purpose of k-NN search? This is due to all documents being normalized to the same length (and query length is the same during the search, simply because it’s the very same query). The equivalence is more obvious when queries and documents all have unit length. However, I think it is still true assuming consistent normalization of documents.

I found another dataset, Yandex Text-to-image: https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search

I tested against the first 500_000 values in the 1M dataset.

It utilizes inner-product, looking at magnitudes, they are all < 1. So this dataset might not be that useful 😕. The magnitudes range from 0.79 - 0.99.

I am just looking for more realistic inner-product search data sets and found this one.

Yandex Image-to-Text

Reversed Ordered Random
image image image
code for transforming yandex fbin
import numpy as np


def read_fbin(filename, start_idx=0, chunk_size=None):
    """ Read *.fbin file that contains float32 vectors
    Args:
        :param filename (str): path to *.fbin file
        :param start_idx (int): start reading vectors from this index
        :param chunk_size (int): number of vectors to read.
                                 If None, read all vectors
    Returns:
        Array of float32 vectors (numpy.ndarray)
    """
    with open(filename, "rb") as f:
        nvecs, dim = np.fromfile(f, count=2, dtype=np.int32)
        nvecs = (nvecs - start_idx) if chunk_size is None else chunk_size
        arr = np.fromfile(f, count=nvecs * dim, dtype=np.float32,
                          offset=start_idx * 4 * dim)
    return arr.reshape(nvecs, dim)


def transform_queries(Q):
    n, _ = Q.shape
    return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32)


def transform_docs(D, norms):
    n, d = D.shape
    max_norm = magnitudes.max()
    flipped_norms = np.copy(norms).reshape(n, 1)
    transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32)
    return transformed_data


def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto):
    assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different"


def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto):
    n1, d1 = arr1.shape
    n2, d2 = arr2.shape
    assert n1 == n2, f"Shape does not map [{arr1.shape}] vs [{arr2.shape}]"
    for i in range(n1):
        validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto)

name = "yandex"
np_flat_ds = read_fbin("base.1M.fbin")[0:500_000]
queries = read_fbin("query.public.100k.fbin")[0:10_000]


transformed_queries = transform_queries(queries)
validate_dataset_match_upto_dim(transformed_queries, queries, 200)
with open(f"{name}-transform.test", "w") as out_f:
    transformed_queries.tofile(out_f)
with open(f"{name}.test", "w") as out_f:
    queries.tofile(out_f)

magnitudes = np.linalg.norm(np_flat_ds, axis=1)
indices = np.argsort(magnitudes)
transformed_np_flat_ds = transform_docs(np_flat_ds, magnitudes)
validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds, 200)
transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices]
np_flat_ds_sorted = np_flat_ds[indices]
with open(f"{name}.random-transform.train", "w") as out_f:
    transformed_np_flat_ds.tofile(out_f)
with open(f"{name}.ordered-transform.train", "w") as out_f:
    transformed_np_flat_ds_sorted.tofile(out_f)
with open(f"{name}.reversed-transform.train", "w") as out_f:
    np.flip(transformed_np_flat_ds_sorted, axis=0).tofile(out_f)

with open(f"{name}.random.train", "w") as out_f:
    np_flat_ds.tofile(out_f)
with open(f"{name}.reversed.train", "w") as out_f:
    np.flip(np_flat_ds_sorted, axis=0).tofile(out_f)
with open(f"{name}.ordered.train", "w") as out_f:
    np_flat_ds_sorted.tofile(out_f)

@msokolov In our BEIR paper we talked about this: https://arxiv.org/abs/2104.08663

The issue with cosine similarity is that it just encodes the topic. For the query What is Pytorch, and you have two docs: A: Pytorch is a framework

B: PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface

A cosine similarity model would retrieve A (it is more on-topic). A dotproduct model would retrieve B, as it is on-topic as well and contains more information on the topic (represented as a longer vector).

I.e. cossim(query, A) > cossim(query, B) but dotprod(query, A) < dotprod(query, B).

So dotproduct models can encode topic + quantity/quality of information, while cosine similarity models just encode topic match (which isn’t perfect for retrieval).

OK ran on two more datasets, I only ran over the first 100k documents in my data sets.

EDIT: found a minor bug, the ‘baseline’ queries were being overwritten by the ‘transform’ ones (the ones with just a 0 at the 769th index). I reran to verify if results were effected; the weren’t. Mainly because to measure recall, we take the brute-force kNN from the query file and compare to the approx-kNN. The data sets and the index built were not effected by this bug. If any more are found, I will happily re-run. Putting my M1 macbook to work 😃.

I think this exhausts our testing for Cohere, we need to find additional data if this is not considered enough.

EN

Reversed Ordered Random
image image image

JA

Reversed Ordered Random
image image image
updating data creation (after download)
import numpy as np
import pyarrow.parquet as pq

DATA_SETS =[
    {"name": "wiki768", "files": [
        "train-00000-of-00004-1a1932c9ca1c7152.parquet",
        "train-00001-of-00004-f4a4f5540ade14b4.parquet",
        "train-00002-of-00004-ff770df3ab420d14.parquet",
        "train-00003-of-00004-85b3dbbc960e92ec.parquet",
    ]},{
    "name": "wiki768en", "files": [
       "0-en.parquet",
            "1-en.parquet",
            "2-en.parquet",
            "3-en.parquet",
    ]},
    {"name": "wiki768ja", "files": [
        "0-ja.parquet",
        "1-ja.parquet",
        "2-ja.parquet",
        "3-ja.parquet",
    ]}
]


def transform_queries(Q):
    n, _ = Q.shape
    return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32)


def transform_docs(D, norms):
    n, d = D.shape
    max_norm = magnitudes.max()
    flipped_norms = np.copy(norms).reshape(n, 1)
    transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32)
    return transformed_data


def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto):
    assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different"


def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto):
    n1, d1 = arr1.shape
    n2, d2 = arr2.shape
    assert n1 == n2, f"Shape does not map [{arr1.shape}] vs [{arr2.shape}]"
    for i in range(n1):
        validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto)

for ds in DATA_SETS:
    name = ds["name"]
    tb1 = pq.read_table(ds["files"][0], columns=['emb'])
    tb2 = pq.read_table(ds["files"][1], columns=['emb'])
    tb3 = pq.read_table(ds["files"][2], columns=['emb'])
    tb4 = pq.read_table(ds["files"][3], columns=['emb'])
    np1 = tb1[0].to_numpy()
    np2 = tb2[0].to_numpy()
    np3 = tb3[0].to_numpy()
    np4 = tb4[0].to_numpy()

    np_total = np.concatenate((np1, np2, np3, np4))

    #Have to convert to a list here to get
    #the numpy ndarray's shape correct later
    #There's probably a better way...
    flat_ds = list()
    for vec in np_total:
        flat_ds.append(vec)
    np_flat_ds = np.array(flat_ds)
    row_count = np_flat_ds.shape[0]
    query_count = 10_000
    training_rows = row_count - query_count
    print(f"{name} num rows: {training_rows}")


    transformed_queries = transform_queries(np_flat_ds[training_rows:-1])
    validate_dataset_match_upto_dim(transformed_queries, np_flat_ds[training_rows:-1], 768)
    with open(f"{name}-transform.test", "w") as out_f:
        transformed_queries.tofile(out_f)
    with open(f"{name}.test", "w") as out_f:
        np_flat_ds[training_rows:-1].tofile(out_f)

    magnitudes = np.linalg.norm(np_flat_ds[0:training_rows], axis=1)
    indices = np.argsort(magnitudes)
    transformed_np_flat_ds = transform_docs(np_flat_ds[0:training_rows], magnitudes)
    validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds[0:training_rows], 768)
    transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices]
    np_flat_ds_sorted = np_flat_ds[indices]
    with open(f"{name}.random-transform.train", "w") as out_f:
        transformed_np_flat_ds.tofile(out_f)
    with open(f"{name}.ordered-transform.train", "w") as out_f:
        transformed_np_flat_ds_sorted.tofile(out_f)
    with open(f"{name}.reversed-transform.train", "w") as out_f:
        np.flip(transformed_np_flat_ds_sorted, axis=0).tofile(out_f)

    with open(f"{name}.random.train", "w") as out_f:
        np.flip(np_flat_ds[0:training_rows], axis=0).tofile(out_f)
    with open(f"{name}.reversed.train", "w") as out_f:
        np.flip(np_flat_ds_sorted, axis=0).tofile(out_f)
    with open(f"{name}.ordered.train", "w") as out_f:
        np_flat_ds_sorted.tofile(out_f)

Useful parsing & plotting functions

def parse_console_output(terminal_output):
    # Regular expression patterns to extract recall and latency values
    recall_pattern = r"(?:\n\d+\.\d+)"
    latency_pattern = r"([\t, ]\d+\.\d+\t\d)"

    recall_values = [float(match.strip()) for match in re.findall(recall_pattern, terminal_output)]
    latency_values = [float(match.split()[0]) for match in re.findall(latency_pattern, terminal_output)]
    return (recall_values, latency_values)


def plot_things(name, baseline_recall, baseline_latency, transformed_recall, transform_latency):
    # Plotting series one: transformed_recall vs transform_latency
    plt.plot(transform_latency, transformed_recall, marker='o', label='transformed')

    # Plotting series two: baseline_recall vs baseline_latency
    plt.plot(baseline_latency, baseline_recall, marker='o', label='original (baseline)')

    # Add labels and title
    plt.xlabel('Latency')
    plt.ylabel('Recall')
    plt.title(f"{name} Transformed vs Baseline recall & latency")
    plt.legend()

    # Show the plot
    plt.grid(True)
    plt.show()

To use them:


transformed_terminal_output = """
WARNING: Gnuplot module not present; will not make charts
recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
WARNING: Using incubator modules: jdk.incubator.vector
Jul 28, 2023 12:01:58 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128
0.863	 0.32	100000	0	48	200	10	0	1.00	post-filter
...
"""

baseline_terminal_output = """
WARNING: Gnuplot module not present; will not make charts
recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
WARNING: Using incubator modules: jdk.incubator.vector
Jul 28, 2023 12:34:59 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128
0.816	 0.23	100000	0	48	200	10	0	1.00	post-filter
...
"""
name = "WikiEN-Reversed"

transformed_recall, transform_latency = parse_console_output(transformed_terminal_output)
baseline_recall, baseline_latency = parse_console_output(baseline_terminal_output)

plot_things(name, baseline_recall, baseline_latency, transformed_recall, transform_latency)
data downloading script

#!/bin/sh


# Japanese
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/0.parquet -o 0-ja.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/1.parquet -o 1-ja.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/33.parquet -o 2-ja.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/34.parquet -o 3-ja.parquet

# English
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/0.parquet -o 0-en.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/1.parquet -o 1-en.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/251.parquet -o 2-en.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/252.parquet -o 3-en.parquet

@benwtrent @jmazanec15 after fiddling for 10 minutes with Google docs, I made a plot with Chat GPT (maybe I should have asked Bard instead)? Anyways, see the attached notebook and the attached picture. There seems to be virtually no difference between the original and the transformed runs.

That said, for high recall area (latency >= 15). The original run is better (compare two rightmost data points).

In terms of additional evidence required: I think we might want to do due diligence and to run more extensive tests, e.g., several datasets. @jmazanec15 publicizing results won’t harm either, b/c we don’t want people to do extra work when extra work is not needed. Moreover, if some clients do want to apply the transform for some reason, it can be easily done outside of the retrieval system, there’s no need for the retrieval system to support it directlry.

MIPS MIPS_baseline_vs_transformed.ipynb.gz

@searchivarius yes pareto curve will be good. I think this is possible in luceneutil. Want to first just compare with results vespa got.

@benwtrent @jmazanec15

Recall seems improved for me. Latency increases in the transformed data. I bet part of this is also the overhead of dealing with CPU execution lanes in Panama as its no longer a “nice” number of dimensions.

You need to look at the curves. The transform changes both recall and latency. The key question is: are we still on the same Pareto curve or not. Because if we are, getting a higher recall is merely a matter of choosing a larger M or ef, you do not need to support transformer in Lucene.

Update: I passed -forceMerge to KnnGraphTester and confirmed recall was again 0.991, confirming results above.

@benwtrent I uncommented these 2 lines: https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java#L699-L702 and set max buffer to 1000000.

This is the index as well:

$ ls -la wiki768.train-48-200.index
total 1232744
drwxr-xr-x.  2 ec2-user ec2-user      16384 Jul 20 05:50 .
drwxr-xr-x. 12 ec2-user ec2-user      16384 Jul 19 23:26 ..
-rw-r--r--.  1 ec2-user ec2-user        159 Jul 20 05:50 _6.fdm
-rw-r--r--.  1 ec2-user ec2-user    1619007 Jul 20 05:50 _6.fdt
-rw-r--r--.  1 ec2-user ec2-user       1437 Jul 20 05:50 _6.fdx
-rw-r--r--.  1 ec2-user ec2-user        195 Jul 20 05:50 _6.fnm
-rw-r--r--.  1 ec2-user ec2-user        464 Jul 20 05:50 _6.si
-rw-r--r--.  1 ec2-user ec2-user 1228800100 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vec
-rw-r--r--.  1 ec2-user ec2-user       9625 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vem
-rw-r--r--.  1 ec2-user ec2-user   31838005 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vex
-rw-r--r--.  1 ec2-user ec2-user        154 Jul 20 05:50 segments_6
-rw-r--r--.  1 ec2-user ec2-user          0 Jul 19 22:29 write.lock

Thank you for the deep information @searchivarius .

eagerly waiting your results @jmazanec15 😃

thank you @jmazanec15 : there’s also an unpublished paper (I can share the preprint privately) where we benchmarked HNSW for maximum inner product search on 3 datasets and it was just fine (for this paper I did try the reduction to the cosine similarity and I also got poorer outcomes). In my thesis, I benchmarked SW-graph (which is pretty much HNSW when it comes to peculiarities of handling the inner product search) using an inner-product like similarity (fusion of BM25 and MODEL1 scores) and it was fine. See the black asterisk run in Figure 3.2.

Moreover, HNSW and SW-graph were tested with non-metric similarities (see again my thesis and references therein) as well as Yury Malkov’s HNSW paper. These methods established SOTA results as well. There is also an extract from the thesis (published separately) that focuses specifically on search with non-metric similarities. Again, things just work..

One may wonder why, right? I think for real datasets the quirky distances don’t deviate from the Euclidean distances all that much so the minimal set of geometric properties required for graph based retrieval is preserved (and no I don’t think the triangle inequality is required).

Specifically, for the inner product search the outcomes are pretty close (in many cases) to the outcomes where the inner product search is replaced with cosine similarity (which is equivalent to L2 search) Why? Because with real embeddings the magnitude of vectors doesn’t change all that much.

That said, there are of course degenerate cases (I know one, but embedding models don’t produce such weirdness) where HNSW won’t work with MIPS (or rather recall will be low). However, I am not aware of any realistic one. If you have some interesting examples of real datasets where direct application of HNSW/SW-graph fails, I would love to have a look.

REGARDING THE SCORE sign: dot-product scores need not be normalized, but the sign can be changed when the results is returned to the user.

@jmazanec15 have you done any digging into the dot-product scaling and if it provides good recall in the MAX-INNER-PRODUCT search use-case?

https://blog.vespa.ai/announcing-maximum-inner-product-search/ && https://towardsdatascience.com/maximum-inner-product-search-using-nearest-neighbor-search-algorithms-c125d24777ef

Implies there might be some weirdness with HNSW and raw MIP. I am honestly not 100% sure if Lucene has this issue specifically with HNSW.

A key observation in MIP is that a vector is no longer closest to itself, but instead it would be much closer to 2*vector than just vector.

And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter?

If you are utilizing hybrid search, negating WAND/MAX_SCORE will slow things down significantly.

We should protect folks from shooting themselves in the foot.

but would add a little confusion.

I agree, there will be confusion. What do you think @uschindler & @msokolov ?

Being able to do non-normalized dot-product is an important aspect of recommendation engines and vector search as a whole. My imagination is too poor to come up with a better solution than adding a new similarity function that uses dot-product under the hood and scales differently.

To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it’s hard to know ahead of time whether the vector may completely dominate scores.

I say we have that problem now. Vector scores and BM25 are nowhere near on the same scale. Folks need to adjust their boosts accordingly, regardless.

And also bw compat as MikeS raised.

I agree, BWC is a big deal here. And I suggest we create a new similarity that just uses dot product under the hood. Call it maximum-inner-product.

As for producing a D.P. that is scaled for use with arbitrary vectors I don’t see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice.

Quoting a SLACK conversation with @nreimers:

Wow, what a bad implementation by Elastic. Models with unnormalized vectors and dot product work better for search than models with normalized vectors / cosine similarity. Models with cosine similarity have the issue that they often retrieve noise when your dataset gets noisier …The best match for a query (e.g. What is the capital of the US ) with cosine similarity is the query itself, as cossim(query, query)=1. So when your corpus gets bigger and is not carefully cleaned, it contains many short documents that look like queries. These are preferably retrieved by the model, so the user asks a questions and gets as a response a doc that is paraphrase of the query (e.g. query=“What is the capital of the US” top-1 hit: Capital of the US). Dot product has the tendency to work better when your corpus gets larger / noisy.