llama.cpp: Slower Response on large context size

Hi folks, this is not really a issue, I need sort of suggestion or may be discussions , I am giving a large input , I am offloading layers to GPU here is my system output:

llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 9807.49 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 7660 MB
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 324, n_keep = 0

And time taken output

llama_print_timings:        load time = 22627.53 ms
llama_print_timings:      sample time =   154.86 ms /   259 runs   (    0.60 ms per token)
llama_print_timings: prompt eval time = 21530.18 ms /  1024 tokens (   21.03 ms per token)
llama_print_timings:        eval time = 125551.93 ms /   258 runs   (  486.64 ms per token)
llama_print_timings:       total time = 159984.71 ms

Takes around 2min30sec to complete the output , is there any way to make it under lets say 1 min or 30 sec ?

Thanks 😃

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (1 by maintainers)

Most upvoted comments

Could you share your parameters?

here is my system output:

the actual system line is missing 😃

@iamirulofficial

The overall time taken by the process including eval + completion is around 2 min 36 sec, I wanted to discuss about how to make that under 30 to 45 sec (Eval + Gen)

You still didn’t answer the questions I asked.

If you’re just asking how to generally make it go faster and not implying you think there’s any problem/unexpected behavior with GGML or llama.cpp then you basically have three options:

  1. Upgrade your hardware.
  2. Pay Mr. GG or other developers to work on optimizing stuff so it runs faster.
  3. Wait and see if the normal rate of progress/improvement (which is actually really fast in this project) hits your target eventually.

For the second two, you’re probably not going to see it go from 2 1/2 minutes to 30sec though. I mean, you’re already running it fully on GPU and people have already done a lot to optimize performance. Expecting 4x increased performance from only software changes is probably pretty unrealistic.