langchain: Timeout when running hugging face LLMs for ConversationRetrivalChain
I am getting the following error when trying to query from a ConversationalRetrievalChain using HuggingFace.
( a ValueError: Error raised by inference API: Model stabilityai/stablelm-tuned-alpha-3b time out
I am query the model using a simple LLMChain, but querying on documents seems to be the issue, any idea ?
This is the code :
from langchain import HuggingFaceHub
llm = HuggingFaceHub(repo_id="stabilityai/stablelm-tuned-alpha-3b" , model_kwargs={"temperature":0, "max_length":64})
qa2 = ConversationalRetrievalChain.from_llm(llm,
vectordb.as_retriever(search_kwargs={"k": 3}), return_source_documents=True)
chat_history = []
query = "How do I load users from a thumb drive."
result = qa2({"question": query, "chat_history": chat_history})
vectordb is coming from Chroma.from_documents (using OpenAI embeddings on a custom pdf) .
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 46 (5 by maintainers)
Links to this issue
Commits related to this issue
- Update Hugging Face Hub notebook (#7236) Description: `flan-t5-xl` hangs, updated to `flan-t5-xxl`. Tested all stabilityai LLMs- all hang so removed from tutorial. Temperature > 0 to prevent uninte... — committed to langchain-ai/langchain by HashemAlsaket a year ago
- Update Hugging Face Hub notebook (#7236) Description: `flan-t5-xl` hangs, updated to `flan-t5-xxl`. Tested all stabilityai LLMs- all hang so removed from tutorial. Temperature > 0 to prevent uninte... — committed to aerrober/langchain-fork by HashemAlsaket a year ago
Use this one: google/flan-t5-xxl and temperature more than 0. (Such as 0.7, 0.5, 1).
Hi everyone, maintainer of
huggingface_hub
’s library here 👋Langchain’s current implementation relies on InferenceAPI. This is a free service from Huggingface to help folks quickly test and prototype things using ML models hosted on the Hub. It is not specific to LLM and is able to run a large variety of models from
transformers
,diffusers
,sentence-transformers
,… While this service is free for everyone, different rate limits apply to unregistered, registered and PRO users. However, the execution time is supposed to be the same no matter the plan (i.e. either you get a HTTP 429 Rate limit, either the request is processed correctly).That being said, InferenceAPI is NOT meant to be a production-ready product. As said before, its primary goal is to allow users to quickly test models. Only models of a “small-enough” size can be hosted on-demand in the InferenceAPI infrastructure. In the context of LLMs, this constraint makes most models unavailable. We still support a manually curated list of first-class open source LLMs (
bigcode/santacoder
,OpenAssistant/oasst-sft-1-pythia-12b
,tiiuae/falcon-7b-instruct
,timdettmers/guanaco-33b-merged
to name a few). The list of running LLMs can be found here. In the backend LLMs are running on our text-generation-inference framework.If one need a production-ready environment, they would need to run on our Inference Endpoints solution. This is a paid service to run models (including LLMs) on a dedicated auto-scaling infrastructure managed by Huggingface. Any model from the Hub can be loaded.
Now that the difference between InferenceAPI and Inference Endpoints is more clear, let’s dig in this issue more specifically. Langchain depends on the
InferenceAPI
client fromhuggingface_hub
. This client will soon be deprecated in favor ofInferenceClient
. This later client is more recent and can handle both InferenceAPI, Inference Endpoint or even AWS Sagemaker solutions. Implementation is meant to be easier to use for end users, especially with automatic parsing of input/output for each task.What I would suggest is to rewrite the
HuggingFaceHub(LLM)
class in langchain to use the newtext-generation
task. A migration guide can be found here. The advantage for the end user would be a better handling of timeouts, a better handling of inputs and the possibility to switch from InferenceAPI to InferenceEndpoint (or any url-based) solutions depending on the use case (prototyping vs production-ready).If anyone wants to pick this topic, please let me know. I’d happy to answer questions and review a PR if requested! 🤗
I also had similar errors with various flan llm’s. Adding max length sorted out timeout errors.
Hi! I have this problem too and I can solve this with increasing the chunk size on the model: llm=HuggingFaceHub(repo_id=“google/flan-t5-xl”, model_kwargs={“temperature”:0, “max_length”:512})
With chunksize = 5000 in the VectorstoreIndexCreator() the model can run.
Depending on the LLM this length limit is different, so is important take care in this sizes. I recommend use others LLM’s or adjust the chunksize for run the ‘stabilityai/stablelm-tuned-alpha-3b’.
You need to upgrade your hugging face account to Pro version to host the large model for inference.
“google/flan-t5-base” works for the free account.
@MatousEibich Best way to test the 70B llama2 without setting anything is to go to https://hf.co/chat/ and select
meta-llama/Llama-2-70b-chat-hf
model 😃EDIT: Llama2 70B is now available to all PRO members through API! You can integrate it with the
InferenceClient
like this:I don’t think this is related to
ConversationRetrivalChain
as I’m seeing the same issue when directly prompting aHuggingFaceHub
object. For instance, when I do:It hangs for a while then returns the timeout error. I’ll ping HuggingFace’s API directly to check whether the issue comes from HuggingFace or
HuggingFaceHub
.ValueError: Error raised by inference API: Model *** time out I got this error message with many models from HF. Is there any workaround for this issue? langchain.llms.HuggingFaceHub() doesn’t seem to have a timeout param you can set
This one worked out really fine! Thx a lot!!
@angel-luis Not familiar with the codebase here but would be happy to review a PR if someone wants to fix it 😃 I described in this comment the main steps to follow from a Hugging Face side.
google/flan-t5-xxl is working for me, although i had to also set temprature to a minimum of 0.1 to get it to work.
llm=HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.1, "max_length":64}))
I have the same issue for a variety of models with various temperatures & max_length. I wonder if it has to do with HuggingFace limiting API calls from a single IP?
Yes. I got the same error:
ValueError: Error raised by inference API: Model google/flan-t5-xl time out
Tried other LLMs as well, same error.