llama-cpp-python: Incredibly slow response time

Hello. I am still new to llama-cpp and I was wondering if it was normal that it takes an incredibly long time to respond to my prompt.

Fyi, I am assuming it runs on my CPU, here are my specs:

  • I have 16.0Gb of RAM
  • I am using an AMD Ryzen 7 1700X Eight-Core Processor rated at 3.40Ghz
  • Just in case, my GPU is a NVIDIA GeForce RTX 2070 SUPER.

Everything else seems to work fine, the model could be load correctly (Or at least, it seems to be). I did a first test using the code showcased in the README.md

from llama_cpp import Llama
llm = Llama(model_path="models/7B/...")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)

which returned me this:

image The output is what I expected (Even though Uranus, Neptune and Pluto were missing), but when I see the total time, it is extremely long (1124707.08ms, 18 minutes).

I did this second code in order to try a bit to see what could be causing the insanely long response time but I don’t know what’s going on.

from llama_cpp import Llama
import time
print("Model loading")
llm = Llama(model_path="./model/ggml-model-q4_0_new.bin")

while True:
    prompt = input("Prompt> ")
    start_time = time.time()

    prompt = f"Q: {prompt} A: "
    print("Your prompt:", prompt, "Start time:", start_time)

    output = llm(prompt, max_tokens=1, stop=["Q:", "\n"], echo=True)
    print("Output:", output)
    print("End time:", time.time())
    print("--- Prompt reply duration: %s seconds ---" % (time.time() - start_time))

I may have done things wrong since I am still new to all of this, but do any of you have any idea on how I could speed up the process? I searched for solutions through google, github and different forums, but nothing seems to work.

PS: For those interested in the CLI output when it loads the model:

llama_model_load: loading model from './model/ggml-model-q4_0_new.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './model/ggml-model-q4_0_new.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

I apologize in advance if my english doesn’t make sense sometimes, it is not my native language. Thanks in advance for the help, regards. 👋

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 35 (6 by maintainers)

Most upvoted comments

For what it’s worth, I am also getting a lot slower generations vs. natively interacting with llama.cpp in the terminal

It might have just been a regression in the base llama.cpp library but who knows

Difference seems to be minimal now

Update: llama.cpp took almost 2.5 hours to respond to my prompt, while the python binding took 18 minutes. I’m not sure if the problem is with the library, but rather with my computer. Maybe my computer has something to do with the slow response time. If any of you have an idea as to why it is so slow on my PC, I’d love to hear it.

Look at your memory and disk utilization. If your ram is anywhere near 100% you’re going to have bad performance if it’s trying to use swap to compensate for a lack of memory. If your disc utilization goes up as you run the executable, that’s a sign that bad things are happening.

If it takes minutes to compute a response then something is definitely wrong with how things are running.

It seems like it is to me as well. It now takes just a few seconds instead of those 18 minutes I had before. I don’t know what happened but I am glad it got fixed. Thanks!

Latest version fixed my issues

profile2

profile