langchain: Kernel dies for SVMRetriever with Huggingface Embeddings Model

System Info

langchain==0.0.245 python==3.9

Who can help?

@hw

Information

The official example notebooks/scripts
My own modified scripts

Related Components

LLMs/Chat Models
Embedding Models
Prompts / Prompt Templates / Prompt Selectors
Output Parsers
Document Loaders
Vector Stores / Retrievers
Memory
Agents / Agent Executors
Tools / Toolkits
Chains
Callbacks/Tracing
Async

Reproduction

I am trying to use HuggingFace Embeddings model for SVMRetriever method as follows:

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.retrievers import SVMRetriever

embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
retriever = SVMRetriever.from_texts(["foo", "bar", "world", "hello", "foo bar"], 
                                                           embedding, 
                                                           k=4, relevancy_threshold=.25)

Kernel looks busy for a while and then dies when I run above snippet.

Expected behavior

I expect to run this smoothly and upon completion, should be able to retrieve similar documents as: result = retriever.get_relevant_documents("foo")

About this issue

Original URL
State: closed
Created a year ago
Comments: 16

Most upvoted comments

🤖

Sorry, I don’t know the answer to that.

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] on Aug 15, 2023

🤖

Sure, I can help with that. Here’s how you can use the SVMRetriever with your document embeddings:

from langchain.retrievers import SVMRetriever

# Assuming `doc_embeddings` is your document embeddings
# and `texts` is your original documents
svm_retriever = SVMRetriever.from_texts(texts=texts, embeddings=doc_embeddings)

Once you have the svm_retriever instance, you can use it to retrieve similar documents. However, please note that the _get_relevant_documents method is a private method, which means it’s not intended to be directly accessed outside of the class. It’s usually better to use public methods for this purpose. But based on the provided context, it seems like this is the method you need to use to retrieve similar documents.

Here’s how you can use it:

query = "your query here"
similar_documents = svm_retriever._get_relevant_documents(query)

This will return a list of similar documents to your query.

Please note that the from_texts method of SVMRetriever takes two parameters: texts and embeddings. The texts parameter is a list of your original documents, and the embeddings parameter is the corresponding document embeddings.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] on Aug 15, 2023

🤖

Hello,

Thank you for reaching out with your question. Based on the context provided, it seems you want to use the HuggingFaceEmbeddings class in LangChain with the feature-extraction task without using the HuggingFaceHub API.

The HuggingFaceEmbeddings class in LangChain uses the sentence_transformers package to compute embeddings. This package is specifically designed for computing sentence or document embeddings and does not support all models and tasks available in the transformers package. In particular, it does not support the feature-extraction task.

However, you can still compute embeddings for your documents and queries using the HuggingFaceEmbeddings class and then pass these embeddings to the SVMRetriever method. Here is an example of how you can do this:

from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

texts = ["This is the first document.", "This is the second document."]
doc_embeddings = hf.embed_documents(texts)

query = "This is the query."
query_embedding = hf.embed_query(query)

Please note that the SVMRetriever method is not shown in the provided context, so you’ll need to refer to the LangChain documentation or source code for the exact usage.

Regarding the HuggingFaceHubEmbeddings class, it does support the feature-extraction task as shown in the VALID_TASKS tuple and the task parameter of the HuggingFaceHubEmbeddings class. However, this class uses the HuggingFaceHub API, which you mentioned you want to avoid.

I hope this helps! If you have any further questions, please don’t hesitate to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] on Aug 15, 2023

🤖

Hello,

Thank you for reaching out and providing detailed information about your issue. Based on the information you’ve provided, it seems like your kernel is dying when trying to use the HuggingFace Embeddings model with the SVMRetriever method in LangChain.

There could be several reasons for this:

Unsupported Model: The HuggingFace model you’re trying to use might not be supported. As per the LangChain code, only models that start with “sentence-transformers” are supported. If you’re using a different model, it might cause the kernel to crash.
Unsupported Task: The task you’re trying to perform might not be supported. The only valid task as per the LangChain code is “feature-extraction”. If you’re trying to perform a different task, it might cause the kernel to crash.
Missing Dependencies: The HuggingFace Hub Python package or the sentence_transformers python package might not be installed. If either of these packages is not installed, an ImportError will be raised, which might cause the kernel to crash.
Missing API Token: The HuggingFace API token might not be set. If the API token is not set, the InferenceApi cannot be initialized, which might cause the kernel to crash.

To troubleshoot this issue, I would recommend the following steps:

Check the Model: Ensure that the model you’re using starts with “sentence-transformers”. If not, try using a model that does.
Check the Task: Ensure that the task you’re trying to perform is “feature-extraction”. If not, try performing this task instead.
Check Dependencies: Ensure that the HuggingFace Hub Python package and the sentence_transformers python package are installed. You can install them using pip:
```
pip install huggingface_hub sentence_transformers
```
Check API Token: Ensure that the HuggingFace API token is set. You can set it as an environment variable HUGGINGFACEHUB_API_TOKEN.

If you’ve checked all of these and the issue still persists, it would be helpful if you could provide more information about your environment, such as the operating system you’re using, the amount of RAM available, and any error messages you’re seeing when the kernel dies.

I hope this helps! Let me know if you have any other questions or if there’s anything else I can do to assist you.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] on Aug 14, 2023

I faced the same issue using HuggingFaceEmbeddings and SVMRetriever. In my case the issue was with create_index in svm.py

The Threadpool executer was somehow failing on executor.map(embeddings.embed_query, contexts) so I made my own without a Threadpool ( I know I know, yet to implement multithreading on it)

But the idea is to build index yourself so you can just do

embeddings = HuggingFaceEmbeddings()

def create_svm_index(docs, embeddings):
     result = []
     for doc in docs:
          result.append(embeddings.embed_query(doc))
     return np.array(result)

index = create_svm_index(docs, embeddings)
svm_retriever = SVMRetriever(index = index, texts = docs, embeddings = embeddings)

#instead of:
#svm_retriever = SVMRetriever.from_texts(texts=docs, embeddings=embeddings)

Abhinay1997 on Dec 5, 2023

@AjinkyaBankar Got it. Not sure what the issue is then 🤷‍♂️ The only suspect lines in SVMRetriever code is this coroutine code

devstein on Aug 17, 2023