langchain: Issue: Problems getting filtering to work with langchain, OpenSearchVectorSearch, similarity_search_with_score
Issue you’d like to raise.
Using langchain on AWS lambda function to call an AWS OpenSearch Serverless vector index. I did not file as a bug, as I suspect that this is a usage problem, but it might be a bug.
I have built an OpenSearch vector index, using AWS OpenSearch Serverless, and are using it to perform vector queries against document chunks. This is all working well. I would now like to add the ability to filter these searches by document (there may be hundreds or thousands of vector entries per document). There is metadata included with each entry, which includes a DocumentId field, which is a string-ized guid. I thought it would be simple to filter against this metadata attribute, but cannot get it working.
The call I am making is:
docs = vsearch.similarity_search_with_score(
theQuery,
index_name="vector-index2",
vector_field = "vector_field",
search_type="script_scoring",
pre_filter=filter_dict,
space_type="cosinesimil",
engine="faiss",
k = 3
)
Typical unfiltered output: {“doc”: {“page_content”: “seat is installed … facing”, “metadata”: {“DocumentId”: “{8B5D5890-0000-C1E8-B0B6-678C8509665D}”, “DocumentTitle”: “2023-Toyota-RAV4-OM.pdf”, “DocClassName”: “Document”}}, “score”: 1.8776696}]
For filter_dict, I have tried: filter_dict = {‘DocumentId’:theDocId} <-- Gets an error filter_dict = {‘bool’: {‘must’: {‘term’: {‘DocumentId’: theDocId} } } } <-- doesn’t find anything filter_dict = {‘bool’: {‘must’: {‘match’: {‘DocumentId’: theDocId} } } } <-- doesn’t find anything filter_dict = {‘must’: {‘match’: {‘DocumentId’: theDocId} } } <-- Gets an error filter_dict = {‘match’: {‘DocumentId’: theDocId} } <-- doesn’t find anything
theDocId is a string set to an input parameter and looks like this: {8B5D5890-0000-C1E8-B0B6-678C8509665D}. Results are the same if I use a string constant (i.e. ‘{8B5D5890-0000-C1E8-B0B6-678C8509665D}’)
Same results if I use search_type=”approximate_search” in conjunction with boolean_filter=filter_dict
Any suggestions on how to get this to work?
Suggestion:
No response
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 1
- Comments: 16
🤖
Hello,
From your code, it seems like you’re trying to use a method
_approximate_search_query_with_boolean_filterthat doesn’t exist in theOpenSearchVectorSearchclass. This is likely the cause of the error message you’re seeing.Instead, you should use the
similarity_search_with_scoremethod with aboolean_filterargument to filter your search results by theDocumentIdfield. Here’s how you can do it:In this example,
boolean_filteris a dictionary that specifies a term filter on the_idfield. This will filter the search results to only include documents with the specifieddocument_id.Please note that the
_idfield is used to store theDocumentIdin the OpenSearch index. If you are using a different field to store theDocumentId, you should replace_idwith the name of that field.The
similarity_search_with_scorefunction returns a list of tuples, where each tuple contains a Document and its corresponding similarity score to the query. You can find more details about this function in the source code.I hope this helps! Let me know if you have any other questions.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
@baskaryan, could you please lend your expertise on this matter?
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for reaching out and providing a detailed explanation of your issue. Based on the information you’ve provided and the context of the LangChain repository, it seems like there might be a misunderstanding about the usage of the
similarity_search_with_scorefunction.The
similarity_search_with_scorefunction in theOpenSearchVectorSearchclass of LangChain does not directly support filtering by metadata fields. This function is designed to return documents and their scores that are most similar to a given query. The function does not take any arguments that would allow for filtering by metadata fields.However, the
OpenSearchVectorSearchclass does support filtering by metadata fields in other methods. Specifically, the_approximate_search_query_with_boolean_filterand_approximate_search_query_with_efficient_filtermethods allow for the inclusion of aboolean_filterandefficient_filterrespectively. These filters can be used to filter results based on metadata fields.Here’s an example of how you might use the
_approximate_search_query_with_boolean_filtermethod:Please note that these methods are currently private (as indicated by the leading underscore), which means they are not intended to be used directly. However, you could potentially modify the
similarity_search_with_scorefunction to include a filter argument and use these methods internally.I hope this helps! If you have any further questions, please don’t hesitate to ask.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for reaching out and providing a detailed explanation of your issue. Based on your description, it seems like you’re trying to use the
filterparameter in thesimilarity_search_with_scorefunction to filter your search results byDocumentId. However, thefilterparameter is not supported in thesimilarity_search_with_scorefunction.Instead, you should use the
boolean_filterorpre_filterparameters, depending on thesearch_typeyou’re using. Here’s how you can do it:DocumentId) and the value to filter for. For example:boolean_filterorpre_filterargument to thesimilarity_search_with_scorefunction. For example:or
Please note that the
boolean_filteris used with the Approximate k-NN Search method, while thepre_filteris used with the Script Scoring Search and Painless Scripting Search methods. The choice between these methods depends on your specific use case and requirements.Here is the supporting evidence from the provided context:
For Approximate k-NN Search with Boolean Filter:
For Script Scoring Search:
For Painless Scripting Search:
I hope this helps! If you have any other questions, feel free to ask.
Best, Dosu
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for reaching out and providing a detailed explanation of your issue. It seems like you’re trying to filter your search results based on the
DocumentIdfield in the metadata attribute.From your description, it appears that you’re trying to use the
pre_filterparameter in thesimilarity_search_with_scorefunction. However, based on the LangChain repository, thesimilarity_search_with_scorefunction does not seem to have apre_filterparameter.Instead, you might want to use the
filterparameter to filter your search results. Here’s an example of how you can do this:In this example, replace
theDocIdwith the ID of the document you want to filter by, and replacetheQuerywith the query you want to search for. Thesimilarity_search_with_scorefunction will return a list of documents most similar to the query text and cosine distance in float for each, filtered by the specified DocumentId.Please note that the filter dictionary should contain the field you want to filter by (in this case, ‘metadata.DocumentId’) and the value you want to filter for.
I hope this helps! If you have any other questions or if something is not clear, please let me know.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
I finally got it working with the following format.
boolean_filter = {“bool”:{“should”:[ {“match” : {“metadata1”:“value1”}}, {“match” : {“metadata2”:“value2”}}, ]}}
I do get a lot of information from this post https://stackoverflow.com/questions/21933787/elasticsearch-not-returning-results-for-terms-query-against-string-property.
The difference in field type (analyzed vs not analyzed) is quite tricky, I switch the word “term” to “match” and it helped a lot. You probably still need to play with the format a little bit (should/must/must not) but hopefully it helps.