langchain: Issue: Problems getting filtering to work with langchain, OpenSearchVectorSearch, similarity_search_with_score

Issue you’d like to raise.

Using langchain on AWS lambda function to call an AWS OpenSearch Serverless vector index. I did not file as a bug, as I suspect that this is a usage problem, but it might be a bug.

I have built an OpenSearch vector index, using AWS OpenSearch Serverless, and are using it to perform vector queries against document chunks. This is all working well. I would now like to add the ability to filter these searches by document (there may be hundreds or thousands of vector entries per document). There is metadata included with each entry, which includes a DocumentId field, which is a string-ized guid. I thought it would be simple to filter against this metadata attribute, but cannot get it working.

The call I am making is:

    docs = vsearch.similarity_search_with_score(
        theQuery,
        index_name="vector-index2",
        vector_field = "vector_field",
        search_type="script_scoring",
        pre_filter=filter_dict,
        space_type="cosinesimil",
        engine="faiss",
        k = 3
    )

Typical unfiltered output: {“doc”: {“page_content”: “seat is installed … facing”, “metadata”: {“DocumentId”: “{8B5D5890-0000-C1E8-B0B6-678C8509665D}”, “DocumentTitle”: “2023-Toyota-RAV4-OM.pdf”, “DocClassName”: “Document”}}, “score”: 1.8776696}]

For filter_dict, I have tried: filter_dict = {‘DocumentId’:theDocId} <-- Gets an error filter_dict = {‘bool’: {‘must’: {‘term’: {‘DocumentId’: theDocId} } } } <-- doesn’t find anything filter_dict = {‘bool’: {‘must’: {‘match’: {‘DocumentId’: theDocId} } } } <-- doesn’t find anything filter_dict = {‘must’: {‘match’: {‘DocumentId’: theDocId} } } <-- Gets an error filter_dict = {‘match’: {‘DocumentId’: theDocId} } <-- doesn’t find anything

theDocId is a string set to an input parameter and looks like this: {8B5D5890-0000-C1E8-B0B6-678C8509665D}. Results are the same if I use a string constant (i.e. ‘{8B5D5890-0000-C1E8-B0B6-678C8509665D}’)

Same results if I use search_type=”approximate_search” in conjunction with boolean_filter=filter_dict

Any suggestions on how to get this to work?

Suggestion:

No response

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Reactions: 1
  • Comments: 16

Most upvoted comments

🤖

Hello,

From your code, it seems like you’re trying to use a method _approximate_search_query_with_boolean_filter that doesn’t exist in the OpenSearchVectorSearch class. This is likely the cause of the error message you’re seeing.

Instead, you should use the similarity_search_with_score method with a boolean_filter argument to filter your search results by the DocumentId field. Here’s how you can do it:

# Assuming `vsearch` is an instance of `OpenSearchVectorSearch`
# and `document_id` is the id you want to filter by
boolean_filter = {"term": {"_id": document_id}}
results = vsearch.similarity_search_with_score(theQuery, k=3, boolean_filter=boolean_filter)

In this example, boolean_filter is a dictionary that specifies a term filter on the _id field. This will filter the search results to only include documents with the specified document_id.

Please note that the _id field is used to store the DocumentId in the OpenSearch index. If you are using a different field to store the DocumentId, you should replace _id with the name of that field.

The similarity_search_with_score function returns a list of tuples, where each tuple contains a Document and its corresponding similarity score to the query. You can find more details about this function in the source code.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

@baskaryan, could you please lend your expertise on this matter?


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

Thank you for reaching out and providing a detailed explanation of your issue. Based on the information you’ve provided and the context of the LangChain repository, it seems like there might be a misunderstanding about the usage of the similarity_search_with_score function.

The similarity_search_with_score function in the OpenSearchVectorSearch class of LangChain does not directly support filtering by metadata fields. This function is designed to return documents and their scores that are most similar to a given query. The function does not take any arguments that would allow for filtering by metadata fields.

However, the OpenSearchVectorSearch class does support filtering by metadata fields in other methods. Specifically, the _approximate_search_query_with_boolean_filter and _approximate_search_query_with_efficient_filter methods allow for the inclusion of a boolean_filter and efficient_filter respectively. These filters can be used to filter results based on metadata fields.

Here’s an example of how you might use the _approximate_search_query_with_boolean_filter method:

filter_dict = {'term': {'DocumentId': '{8B5D5890-0000-C1E8-B0B6-678C8509665D}' } }  
query_vector = ...  # Your query vector
vsearch._approximate_search_query_with_boolean_filter(
    query_vector,
    boolean_filter=filter_dict,
    k=3,
    vector_field="vector_field"
)

Please note that these methods are currently private (as indicated by the leading underscore), which means they are not intended to be used directly. However, you could potentially modify the similarity_search_with_score function to include a filter argument and use these methods internally.

I hope this helps! If you have any further questions, please don’t hesitate to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

Thank you for reaching out and providing a detailed explanation of your issue. Based on your description, it seems like you’re trying to use the filter parameter in the similarity_search_with_score function to filter your search results by DocumentId. However, the filter parameter is not supported in the similarity_search_with_score function.

Instead, you should use the boolean_filter or pre_filter parameters, depending on the search_type you’re using. Here’s how you can do it:

  1. Create a dictionary for your filter. The dictionary should specify the field to filter on (in this case, DocumentId) and the value to filter for. For example:
filter_dict = {"term": {"DocumentId": 123}}
  1. Pass this dictionary as the boolean_filter or pre_filter argument to the similarity_search_with_score function. For example:
results = langchain_instance.similarity_search_with_score(query, k, boolean_filter=filter_dict)

or

results = langchain_instance.similarity_search_with_score(query, k, pre_filter=filter_dict)

Please note that the boolean_filter is used with the Approximate k-NN Search method, while the pre_filter is used with the Script Scoring Search and Painless Scripting Search methods. The choice between these methods depends on your specific use case and requirements.

Here is the supporting evidence from the provided context:

For Approximate k-NN Search with Boolean Filter:

def _approximate_search_query_with_boolean_filter(
    query_vector: List[float],
    boolean_filter: Dict,
    k: int = 4,
    vector_field: str = "vector_field",
    subquery_clause: str = "must",
) -> Dict:
    """For Approximate k-NN Search, with Boolean Filter."""
    return {
        "size": k,
        "query": {
            "bool": {
                "filter": boolean_filter,
                subquery_clause: [
                    {"knn": {vector_field: {"vector": query_vector, "k": k}}}
                ],
            }
        },
    }

For Script Scoring Search:

def _default_script_query(
    query_vector: List[float],
    k: int = 4,
    space_type: str = "l2",
    pre_filter: Optional[Dict] = None,
    vector_field: str = "vector_field",
) -> Dict:
    """For Script Scoring Search, this is the default query."""

    if not pre_filter:
        pre_filter = MATCH_ALL_QUERY

    return {
        "size": k,
        "query": {
            "script_score": {
                "query": pre_filter,
                "script": {
                    "source": "knn_score",
                    "lang": "knn",
                    "params": {
                        "field": vector_field,
                        "query_value": query_vector,
                        "space_type": space_type,
                    },
                },
            }
        },
    }

For Painless Scripting Search:

def _default_painless_scripting_query(
    query_vector: List[float],
    k: int = 4,
    space_type: str = "l2Squared",
    pre_filter: Optional[Dict] = None,
    vector_field: str = "vector_field",
) -> Dict:
    """For Painless Scripting Search, this is the default query."""

    if not pre_filter:
        pre_filter = MATCH_ALL_QUERY

    source = __get_painless_scripting_source(space_type, vector_field=vector_field)
    return {
        "size": k,
        "query": {
            "script_score": {
                "query": pre_filter,
                "script": {
                    "source": source,
                    "params": {
                        "field": vector_field,
                        "query_value": query_vector,
                    },
                },
            }
        },
    }

I hope this helps! If you have any other questions, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

Thank you for reaching out and providing a detailed explanation of your issue. It seems like you’re trying to filter your search results based on the DocumentId field in the metadata attribute.

From your description, it appears that you’re trying to use the pre_filter parameter in the similarity_search_with_score function. However, based on the LangChain repository, the similarity_search_with_score function does not seem to have a pre_filter parameter.

Instead, you might want to use the filter parameter to filter your search results. Here’s an example of how you can do this:

filter_dict = {'metadata.DocumentId': theDocId}

docs = vsearch.similarity_search_with_score(
    theQuery,
    index_name="vector-index2",
    vector_field = "vector_field",
    search_type="script_scoring",
    filter=filter_dict,
    space_type="cosinesimil",
    engine="faiss",
    k = 3
)

In this example, replace theDocId with the ID of the document you want to filter by, and replace theQuery with the query you want to search for. The similarity_search_with_score function will return a list of documents most similar to the query text and cosine distance in float for each, filtered by the specified DocumentId.

Please note that the filter dictionary should contain the field you want to filter by (in this case, ‘metadata.DocumentId’) and the value you want to filter for.

I hope this helps! If you have any other questions or if something is not clear, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

I finally got it working with the following format.

boolean_filter = {“bool”:{“should”:[ {“match” : {“metadata1”:“value1”}}, {“match” : {“metadata2”:“value2”}}, ]}}

I do get a lot of information from this post https://stackoverflow.com/questions/21933787/elasticsearch-not-returning-results-for-terms-query-against-string-property.

The difference in field type (analyzed vs not analyzed) is quite tricky, I switch the word “term” to “match” and it helped a lot. You probably still need to play with the format a little bit (should/must/must not) but hopefully it helps.