langchain: Unable to run llama.cpp or GPT4All demos
I’m attempting to run both demos linked today but am running into issues. I’ve already migrated my GPT4All model.
When I run the llama.cpp demo all of my CPU cores are pegged at 100% for a minute or so and then it just exits without an error code or output.
When I run the GPT4All demo I get the following error:
Traceback (most recent call last):
File "/home/zetaphor/Code/langchain-demo/gpt4alldemo.py", line 12, in <module>
llm = GPT4All(model_path="models/gpt4all-lora-quantized-new.bin")
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
File "pydantic/main.py", line 1102, in pydantic.main.validate_model
File "/home/zetaphor/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain/llms/gpt4all.py", line 132, in validate_environment
ggml_model=values["model"],
KeyError: 'model'
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 23 (6 by maintainers)
Try changing the
model_path
param tomodel
.I tried the following on Colab, but the last line never finishes… Does anyone have a clue?
@hershkoy absolutely, all you have to do is change the following two lines
First update the llm import near the top of your file
and then where you instantiate the class
Let me know if that still gives you an error on your system
On the suggestion of someone in Discord I’m able to get output using the llama.cpp model, it looks like the thing we needed to do here was to increase the token context to a much higher value. However I am definitely seeing reduced performance compared to what I experience when just running inference through llama.cpp
You don’t happen to have an example at hand of an minimal example with LlamaCpp and streaming the output, word for word? I am only getting the output streamed on the console but cannot write it into an object and I don’t understand how to access the stream.
Just FYI, the slowdown in performance is a bug. It’s being investigated here ggerganov/llama.cpp#603 . Inference should NOT slow down with increased context
@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default
n_ctx
value in langchain. You can set it at 2048 max, but this will slow down inference.Hi @Zetaphor are you referring to this Llama demo?
I’m the author of the
llama-cpp-python
library, I’d be happy to help.Can you give me an idea of what kind of processor you’re running and the length of your prompt?
Because llama.cpp is running inference on the CPU it can take a while to process the initial prompt and there are still some performance issues with certain CPU architectures. To rule that out have can you try running the same prompt through the examples in llama.cpp?