langchain: Timeout when running hugging face LLMs for ConversationRetrivalChain

I am getting the following error when trying to query from a ConversationalRetrievalChain using HuggingFace. ( a ValueError: Error raised by inference API: Model stabilityai/stablelm-tuned-alpha-3b time out

I am query the model using a simple LLMChain, but querying on documents seems to be the issue, any idea ?

This is the code :

from langchain import HuggingFaceHub
llm = HuggingFaceHub(repo_id="stabilityai/stablelm-tuned-alpha-3b" , model_kwargs={"temperature":0, "max_length":64})
qa2 = ConversationalRetrievalChain.from_llm(llm,
                                    vectordb.as_retriever(search_kwargs={"k": 3}), return_source_documents=True)
chat_history = []
query = "How do I load users from a thumb drive."
result = qa2({"question": query, "chat_history": chat_history})

vectordb is coming from Chroma.from_documents (using OpenAI embeddings on a custom pdf) .

About this issue

Original URL
State: open
Created a year ago
Comments: 46 (5 by maintainers)

Links to this issue

Hugging Face Hub LLMs keep timing out

Commits related to this issue

Update Hugging Face Hub notebook (#7236) Description: `flan-t5-xl` hangs, updated to `flan-t5-xxl`. Tested all stabilityai LLMs- all hang so removed from tutorial. Temperature > 0 to prevent uninte... — committed to langchain-ai/langchain by HashemAlsaket a year ago
Update Hugging Face Hub notebook (#7236) Description: `flan-t5-xl` hangs, updated to `flan-t5-xxl`. Tested all stabilityai LLMs- all hang so removed from tutorial. Temperature > 0 to prevent uninte... — committed to aerrober/langchain-fork by HashemAlsaket a year ago

Most upvoted comments

Use this one: google/flan-t5-xxl and temperature more than 0. (Such as 0.7, 0.5, 1).

+19

phdykd on May 4, 2023

Hi everyone, maintainer of huggingface_hub’s library here 👋

Langchain’s current implementation relies on InferenceAPI. This is a free service from Huggingface to help folks quickly test and prototype things using ML models hosted on the Hub. It is not specific to LLM and is able to run a large variety of models from transformers, diffusers, sentence-transformers,… While this service is free for everyone, different rate limits apply to unregistered, registered and PRO users. However, the execution time is supposed to be the same no matter the plan (i.e. either you get a HTTP 429 Rate limit, either the request is processed correctly).

That being said, InferenceAPI is NOT meant to be a production-ready product. As said before, its primary goal is to allow users to quickly test models. Only models of a “small-enough” size can be hosted on-demand in the InferenceAPI infrastructure. In the context of LLMs, this constraint makes most models unavailable. We still support a manually curated list of first-class open source LLMs (bigcode/santacoder, OpenAssistant/oasst-sft-1-pythia-12b, tiiuae/falcon-7b-instruct, timdettmers/guanaco-33b-merged to name a few). The list of running LLMs can be found here. In the backend LLMs are running on our text-generation-inference framework.

If one need a production-ready environment, they would need to run on our Inference Endpoints solution. This is a paid service to run models (including LLMs) on a dedicated auto-scaling infrastructure managed by Huggingface. Any model from the Hub can be loaded.

Now that the difference between InferenceAPI and Inference Endpoints is more clear, let’s dig in this issue more specifically. Langchain depends on the InferenceAPI client from huggingface_hub. This client will soon be deprecated in favor of InferenceClient. This later client is more recent and can handle both InferenceAPI, Inference Endpoint or even AWS Sagemaker solutions. Implementation is meant to be easier to use for end users, especially with automatic parsing of input/output for each task.

What I would suggest is to rewrite the HuggingFaceHub(LLM) class in langchain to use the new text-generation task. A migration guide can be found here. The advantage for the end user would be a better handling of timeouts, a better handling of inputs and the possibility to switch from InferenceAPI to InferenceEndpoint (or any url-based) solutions depending on the use case (prototyping vs production-ready).

If anyone wants to pick this topic, please let me know. I’d happy to answer questions and review a PR if requested! 🤗

+11

Wauplin on Jul 11, 2023

I also had similar errors with various flan llm’s. Adding max length sorted out timeout errors.

llm_hf = HuggingFaceHub(
        repo_id="google/flan-t5-xxl",
        model_kwargs={"temperature":0.9,"max_length":64 }
)

iriscortex on May 11, 2023

Hi! I have this problem too and I can solve this with increasing the chunk size on the model: llm=HuggingFaceHub(repo_id=“google/flan-t5-xl”, model_kwargs={“temperature”:0, “max_length”:512})

With chunksize = 5000 in the VectorstoreIndexCreator() the model can run.

Depending on the LLM this length limit is different, so is important take care in this sizes. I recommend use others LLM’s or adjust the chunksize for run the ‘stabilityai/stablelm-tuned-alpha-3b’.

ygorbalves on Apr 22, 2023

You need to upgrade your hugging face account to Pro version to host the large model for inference.

“google/flan-t5-base” works for the free account.

yitao416 on Jun 29, 2023

@MatousEibich Best way to test the 70B llama2 without setting anything is to go to https://hf.co/chat/ and select meta-llama/Llama-2-70b-chat-hf model 😃

EDIT: Llama2 70B is now available to all PRO members through API! You can integrate it with the InferenceClient like this:

>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient("meta-llama/Llama-2-70b-chat-hf")
>>> client.text_generation("Describe the Hugging Face emoji", max_new_tokens=200)
'🤗.\nThe Hugging Face emoji 🤗 is a yellow face with a big smile and its arms wide open, embracing someone or something. It can be used to express a sense of warmth, comfort, and affection, and can also be used to show support or solidarity with someone. The emoji can be used in a variety of contexts, such as responding to a heartfelt message, expressing gratitude, or simply expressing a sense of love or care. The Hugging Face emoji is a popular way to convey a sense of warmth and connection in digital communication.'

Wauplin on Aug 7, 2023

I don’t think this is related to ConversationRetrivalChain as I’m seeing the same issue when directly prompting a HuggingFaceHub object. For instance, when I do:

repo_id = "google/flan-t5-xl"
llm_hf = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0, "max_length":128})
print(llm_hf(text))

It hangs for a while then returns the timeout error. I’ll ping HuggingFace’s API directly to check whether the issue comes from HuggingFace or HuggingFaceHub.

bcrestel on Apr 26, 2023

ValueError: Error raised by inference API: Model *** time out I got this error message with many models from HF. Is there any workaround for this issue? langchain.llms.HuggingFaceHub() doesn’t seem to have a timeout param you can set

xsqian on Oct 26, 2023

Use this one: google/flan-t5-xxl and temperature more than 0. (Such as 0.7, 0.5, 1).

This one worked out really fine! Thx a lot!!

leofab on Sep 12, 2023

@angel-luis Not familiar with the codebase here but would be happy to review a PR if someone wants to fix it 😃 I described in this comment the main steps to follow from a Hugging Face side.

Wauplin on Aug 7, 2023

google/flan-t5-xxl is working for me, although i had to also set temprature to a minimum of 0.1 to get it to work. llm=HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.1, "max_length":64}))

yomhb on Jun 2, 2023

I have the same issue for a variety of models with various temperatures & max_length. I wonder if it has to do with HuggingFace limiting API calls from a single IP?

philmui on May 14, 2023

This is happening for me as well. This Colab notebook for question answering worked for me a few days ago, but now when I enter a new query in cell 14 and run it, it will result in a timeout.

Yes. I got the same error:

ValueError: Error raised by inference API: Model google/flan-t5-xl time out

Tried other LLMs as well, same error.

spinning27 on May 10, 2023