langchain: Issue: Problems getting filtering to work with langchain, OpenSearchVectorSearch, similarity_search_with_score
Issue you’d like to raise.
Using langchain on AWS lambda function to call an AWS OpenSearch Serverless vector index. I did not file as a bug, as I suspect that this is a usage problem, but it might be a bug.
I have built an OpenSearch vector index, using AWS OpenSearch Serverless, and are using it to perform vector queries against document chunks. This is all working well. I would now like to add the ability to filter these searches by document (there may be hundreds or thousands of vector entries per document). There is metadata included with each entry, which includes a DocumentId field, which is a string-ized guid. I thought it would be simple to filter against this metadata attribute, but cannot get it working.
The call I am making is:
docs = vsearch.similarity_search_with_score(
theQuery,
index_name="vector-index2",
vector_field = "vector_field",
search_type="script_scoring",
pre_filter=filter_dict,
space_type="cosinesimil",
engine="faiss",
k = 3
)
Typical unfiltered output: {“doc”: {“page_content”: “seat is installed … facing”, “metadata”: {“DocumentId”: “{8B5D5890-0000-C1E8-B0B6-678C8509665D}”, “DocumentTitle”: “2023-Toyota-RAV4-OM.pdf”, “DocClassName”: “Document”}}, “score”: 1.8776696}]
For filter_dict, I have tried: filter_dict = {‘DocumentId’:theDocId} <-- Gets an error filter_dict = {‘bool’: {‘must’: {‘term’: {‘DocumentId’: theDocId} } } } <-- doesn’t find anything filter_dict = {‘bool’: {‘must’: {‘match’: {‘DocumentId’: theDocId} } } } <-- doesn’t find anything filter_dict = {‘must’: {‘match’: {‘DocumentId’: theDocId} } } <-- Gets an error filter_dict = {‘match’: {‘DocumentId’: theDocId} } <-- doesn’t find anything
theDocId is a string set to an input parameter and looks like this: {8B5D5890-0000-C1E8-B0B6-678C8509665D}. Results are the same if I use a string constant (i.e. ‘{8B5D5890-0000-C1E8-B0B6-678C8509665D}’)
Same results if I use search_type=”approximate_search” in conjunction with boolean_filter=filter_dict
Any suggestions on how to get this to work?
Suggestion:
No response
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 1
- Comments: 16
🤖
Hello,
From your code, it seems like you’re trying to use a method
_approximate_search_query_with_boolean_filter
that doesn’t exist in theOpenSearchVectorSearch
class. This is likely the cause of the error message you’re seeing.Instead, you should use the
similarity_search_with_score
method with aboolean_filter
argument to filter your search results by theDocumentId
field. Here’s how you can do it:In this example,
boolean_filter
is a dictionary that specifies a term filter on the_id
field. This will filter the search results to only include documents with the specifieddocument_id
.Please note that the
_id
field is used to store theDocumentId
in the OpenSearch index. If you are using a different field to store theDocumentId
, you should replace_id
with the name of that field.The
similarity_search_with_score
function returns a list of tuples, where each tuple contains a Document and its corresponding similarity score to the query. You can find more details about this function in the source code.I hope this helps! Let me know if you have any other questions.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
@baskaryan, could you please lend your expertise on this matter?
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for reaching out and providing a detailed explanation of your issue. Based on the information you’ve provided and the context of the LangChain repository, it seems like there might be a misunderstanding about the usage of the
similarity_search_with_score
function.The
similarity_search_with_score
function in theOpenSearchVectorSearch
class of LangChain does not directly support filtering by metadata fields. This function is designed to return documents and their scores that are most similar to a given query. The function does not take any arguments that would allow for filtering by metadata fields.However, the
OpenSearchVectorSearch
class does support filtering by metadata fields in other methods. Specifically, the_approximate_search_query_with_boolean_filter
and_approximate_search_query_with_efficient_filter
methods allow for the inclusion of aboolean_filter
andefficient_filter
respectively. These filters can be used to filter results based on metadata fields.Here’s an example of how you might use the
_approximate_search_query_with_boolean_filter
method:Please note that these methods are currently private (as indicated by the leading underscore), which means they are not intended to be used directly. However, you could potentially modify the
similarity_search_with_score
function to include a filter argument and use these methods internally.I hope this helps! If you have any further questions, please don’t hesitate to ask.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for reaching out and providing a detailed explanation of your issue. Based on your description, it seems like you’re trying to use the
filter
parameter in thesimilarity_search_with_score
function to filter your search results byDocumentId
. However, thefilter
parameter is not supported in thesimilarity_search_with_score
function.Instead, you should use the
boolean_filter
orpre_filter
parameters, depending on thesearch_type
you’re using. Here’s how you can do it:DocumentId
) and the value to filter for. For example:boolean_filter
orpre_filter
argument to thesimilarity_search_with_score
function. For example:or
Please note that the
boolean_filter
is used with the Approximate k-NN Search method, while thepre_filter
is used with the Script Scoring Search and Painless Scripting Search methods. The choice between these methods depends on your specific use case and requirements.Here is the supporting evidence from the provided context:
For Approximate k-NN Search with Boolean Filter:
For Script Scoring Search:
For Painless Scripting Search:
I hope this helps! If you have any other questions, feel free to ask.
Best, Dosu
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for reaching out and providing a detailed explanation of your issue. It seems like you’re trying to filter your search results based on the
DocumentId
field in the metadata attribute.From your description, it appears that you’re trying to use the
pre_filter
parameter in thesimilarity_search_with_score
function. However, based on the LangChain repository, thesimilarity_search_with_score
function does not seem to have apre_filter
parameter.Instead, you might want to use the
filter
parameter to filter your search results. Here’s an example of how you can do this:In this example, replace
theDocId
with the ID of the document you want to filter by, and replacetheQuery
with the query you want to search for. Thesimilarity_search_with_score
function will return a list of documents most similar to the query text and cosine distance in float for each, filtered by the specified DocumentId.Please note that the filter dictionary should contain the field you want to filter by (in this case, ‘metadata.DocumentId’) and the value you want to filter for.
I hope this helps! If you have any other questions or if something is not clear, please let me know.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
I finally got it working with the following format.
boolean_filter = {“bool”:{“should”:[ {“match” : {“metadata1”:“value1”}}, {“match” : {“metadata2”:“value2”}}, ]}}
I do get a lot of information from this post https://stackoverflow.com/questions/21933787/elasticsearch-not-returning-results-for-terms-query-against-string-property.
The difference in field type (analyzed vs not analyzed) is quite tricky, I switch the word “term” to “match” and it helped a lot. You probably still need to play with the format a little bit (should/must/must not) but hopefully it helps.