llama-cpp-python: Incredibly slow response time
Hello. I am still new to llama-cpp and I was wondering if it was normal that it takes an incredibly long time to respond to my prompt.
Fyi, I am assuming it runs on my CPU, here are my specs:
- I have 16.0Gb of RAM
- I am using an AMD Ryzen 7 1700X Eight-Core Processor rated at 3.40Ghz
- Just in case, my GPU is a NVIDIA GeForce RTX 2070 SUPER.
Everything else seems to work fine, the model could be load correctly (Or at least, it seems to be). I did a first test using the code showcased in the README.md
from llama_cpp import Llama
llm = Llama(model_path="models/7B/...")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)
which returned me this:
The output is what I expected (Even though Uranus, Neptune and Pluto were missing), but when I see the total time, it is extremely long (1124707.08ms
, 18 minutes
).
I did this second code in order to try a bit to see what could be causing the insanely long response time but I don’t know what’s going on.
from llama_cpp import Llama
import time
print("Model loading")
llm = Llama(model_path="./model/ggml-model-q4_0_new.bin")
while True:
prompt = input("Prompt> ")
start_time = time.time()
prompt = f"Q: {prompt} A: "
print("Your prompt:", prompt, "Start time:", start_time)
output = llm(prompt, max_tokens=1, stop=["Q:", "\n"], echo=True)
print("Output:", output)
print("End time:", time.time())
print("--- Prompt reply duration: %s seconds ---" % (time.time() - start_time))
I may have done things wrong since I am still new to all of this, but do any of you have any idea on how I could speed up the process? I searched for solutions through google, github and different forums, but nothing seems to work.
PS: For those interested in the CLI output when it loads the model:
llama_model_load: loading model from './model/ggml-model-q4_0_new.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './model/ggml-model-q4_0_new.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
I apologize in advance if my english doesn’t make sense sometimes, it is not my native language. Thanks in advance for the help, regards. 👋
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 35 (6 by maintainers)
For what it’s worth, I am also getting a lot slower generations vs. natively interacting with llama.cpp in the terminal
It might have just been a regression in the base llama.cpp library but who knows
Difference seems to be minimal now
Look at your memory and disk utilization. If your ram is anywhere near 100% you’re going to have bad performance if it’s trying to use swap to compensate for a lack of memory. If your disc utilization goes up as you run the executable, that’s a sign that bad things are happening.
If it takes minutes to compute a response then something is definitely wrong with how things are running.
It seems like it is to me as well. It now takes just a few seconds instead of those 18 minutes I had before. I don’t know what happened but I am glad it got fixed. Thanks!
Latest version fixed my issues