langchain: Unable to run llama.cpp or GPT4All demos

I’m attempting to run both demos linked today but am running into issues. I’ve already migrated my GPT4All model.

When I run the llama.cpp demo all of my CPU cores are pegged at 100% for a minute or so and then it just exits without an error code or output.

When I run the GPT4All demo I get the following error:

Traceback (most recent call last):
  File "/home/zetaphor/Code/langchain-demo/gpt4alldemo.py", line 12, in <module>
    llm = GPT4All(model_path="models/gpt4all-lora-quantized-new.bin")
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1102, in pydantic.main.validate_model
  File "/home/zetaphor/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain/llms/gpt4all.py", line 132, in validate_environment
    ggml_model=values["model"],
KeyError: 'model'

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23 (6 by maintainers)

Most upvoted comments

Try changing the model_path param to model.

I tried the following on Colab, but the last line never finishes… Does anyone have a clue?

from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm = GPT4All(model="{path_to_ggml}")
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)

@hershkoy absolutely, all you have to do is change the following two lines

First update the llm import near the top of your file

from langchain.llms import LlamaCpp

and then where you instantiate the class

llm = LlamaCpp(model_path="gpt4all-converted.bin", n_ctx=2048)

Let me know if that still gives you an error on your system

On the suggestion of someone in Discord I’m able to get output using the llama.cpp model, it looks like the thing we needed to do here was to increase the token context to a much higher value. However I am definitely seeing reduced performance compared to what I experience when just running inference through llama.cpp

import os
from langchain.memory import ConversationTokenBufferMemory
from langchain.agents.tools import Tool
from langchain.chat_models import ChatOpenAI
from langchain.llms.base import LLM
from langchain.llms.llamacpp import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.agents import load_tools, initialize_agent, AgentExecutor, BaseSingleActionAgent, AgentType


custom_llm = LlamaCpp(model_path="models/ggml-vicuna-13b-4bit.bin", verbose=True,
                      n_threads=4, n_ctx=5000, temperature=0.05, repeat_penalty=1.22)
tools = []

memories = {}

question = "What is the meaning of life?"
unique_id = os.urandom(16).hex()
if unique_id not in memories:
    memories[unique_id] = ConversationTokenBufferMemory(
        memory_key="chat_history", llm=custom_llm, return_messages=True)
memory = memories[unique_id]
agent = initialize_agent(tools, llm=custom_llm, agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
                         verbose=True, memory=memory, max_retries=1)

response = agent.run(input=question)

print(response)

Specifically for llama.cpp I think #2404 (comment) points to the issue being in the Callback Manager.

It’s likely that due to the async nature of callback manager the “main” program exits before the chain returns.

To test this I put a sleep loop, but it also seems that perhaps the callback manager isn’t being used with run or is faulty for this LLM wrapper.

This shows that streaming should be used. (the streaming property is True by default) https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L217-L228

The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger. https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L271-L273

You don’t happen to have an example at hand of an minimal example with LlamaCpp and streaming the output, word for word? I am only getting the output streamed on the console but cannot write it into an object and I don’t understand how to access the stream.

@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default n_ctx value in langchain. You can set it at 2048 max, but this will slow down inference.

Just FYI, the slowdown in performance is a bug. It’s being investigated here ggerganov/llama.cpp#603 . Inference should NOT slow down with increased context

@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default n_ctx value in langchain. You can set it at 2048 max, but this will slow down inference.

Hi @Zetaphor are you referring to this Llama demo?

I’m the author of the llama-cpp-python library, I’d be happy to help.

Can you give me an idea of what kind of processor you’re running and the length of your prompt?

Because llama.cpp is running inference on the CPU it can take a while to process the initial prompt and there are still some performance issues with certain CPU architectures. To rule that out have can you try running the same prompt through the examples in llama.cpp?