llama.cpp: Speculative Decoding is slower than expected on A100
Thanks for the great project! I am benchmarking the performance of llamacpp with speculative decoding.
- model setting: draft model llama-160m, target model llama7b. When I benchmark on Mac M1 chip, the results look great: speculative decoding increases the speed from ~12 tokens/s to ~16 tokens/s. However, the performance is not very good on A100. Concretely, the speed of the target model and draft model are:
draft:
llama_print_timings: load time = 65.11 ms
llama_print_timings: sample time = 524.95 ms / 1 runs ( 524.95 ms per token, 1.90 tokens per second)
llama_print_timings: prompt eval time = 8.59 ms / 94 tokens ( 0.09 ms per token, 10946.78 tokens per second)
llama_print_timings: eval time = 322.80 ms / 216 runs ( 1.49 ms per token, 669.15 tokens per second)
llama_print_timings: total time = 2924.72 ms
target:
llama_print_timings: load time = 1144.77 ms
llama_print_timings: sample time = 4.02 ms / 259 runs ( 0.02 ms per token, 64411.84 tokens per second)
llama_print_timings: prompt eval time = 1939.02 ms / 351 tokens ( 5.52 ms per token, 181.02 tokens per second)
llama_print_timings: eval time = 13.19 ms / 1 runs ( 13.19 ms per token, 75.82 tokens per second)
llama_print_timings: total time = 2999.59 ms
I am using greedy decoding and disabling all the heuristics (fix n_draft, always propose n_draft tokens and avoid early stopping). My execution cmd is:
./build/bin/speculative \
-ngl 1000 \
-ngld 100 \
-m /data/model/llama-7b/ggml-model-f16.gguf \
-md /data/model/lama-160m/ggml-model-f16.gguf \
-p "${prompt}" \
-e --temp "-1" -n 256 -s 1 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5
When token acceptance rate is 0.44, speculative decoding is actually slower (notice 50 tokens/s < 75 tokens/s)
encoded 94 tokens in 0.076 seconds, speed: 1231.914 t/s
decoded 108 tokens in 2.145 seconds, speed: 50.341 t/s
n_draft = 5
n_predict = 108
n_drafted = 165
n_accept = 74
accept = 44.848%
However, based on the original speculative paper, the speedup should be:
where
alpha is the token acceptance rate, gamma is the number of tokens proposed each step, and c is the ratio between the execution times of the draft and target models. In the example above, c is roughly 76/669=0.11.
Plugin in the numbers above, the expected speedup should be:
(1-0.44^6)/[(1-0.44)*(0.11*0.44+1)]=1.69x.
However, the benchmarking results show that it’s actually 50/76=0.66x.
To debug this, I set the token acceptance rate to 100% by removing the id==draft_id[i_dft] here. After doing this, I observe that the speed is ~90tokens/s, which brings 90/76=1.18x speedup. However, this is much smaller than the calculation with the formula above (I use 0.99 as the token acceptance rate instead of 1):
(1-0.99^6)/[(1-0.99)*(0.11*0.99+1)]=5.27x.
I wonder which part of the speculative decoding might cause big overhead, any comments are highly appreciated! Thanks!
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 2
- Comments: 17 (10 by maintainers)
I might be missing something, but I think there is an error in the number representations of the equation. In the first case, for example, it should be:
(1-0.44^6)/[(1-0.44)*(0.11*5+1)]= 1.14xbecause
gammain the denominator is5- not0.44I did some more testing on a V100 16GB GPU using the same models. Here is my script to determine the theoretical speed-up according to the paper:
In
case 0I run the following:In
case 1I run this:I am using the branch in #3624 with
-np 1it should be equivalent tomaster. I have applied the following patch to simulate high-acceptance rate (similar to your changes):To determine
sd_tg,st_tgandst_ppI run the following benchmark commands:For
sd_tgI pick the result from the first bench.For
st_tgI pick thetg 128result from the second bench. Forst_ppI pick the respectivepp gvalue based on the value ofg.My understanding is that
s_avgis the average speed in speculative mode where we take into account the speed both for drafting and evaluating the drafted batch on the target model.The theoretical results this way are as follows:
While the observed are:
Thank you for the detailed report - very useful information!
~If you add
-nommqCLI arg, do the numbers improve?~ (Edit: nvm,-nommqdoes not make a difference for F16 models)I’ll try to do the same test today and see if I can find the bottleneck.
Hi @ggerganov
I did some tests with the speculative example, and some quantizations appear to be fine when running 72% of the model in vram.
In the case of running this setup in the speculative example with a 70B Q3_K_S I get 1.3x speedup on all chat formats, offloading 57 layers to a 3090, with
top-k 1and all layers of the 1.5T tinyllama base Q4_K_M model. Where my 5.4 t/s goes up to 7.1 t/s.This is the same speedup factor I am getting on pure cpu speculative sampling with the model. (1.3x, where 1.5 t/s goes to 2 t/s)
It’s a general speedup, and shouldn’t be limited to coding examples, it works for instruct/chat formats.
I also tried exllamav2’s speculative sampling examples and sampling parameters. These give a speedup of mostly 1.5x on the chat responses. When running the Quicksort code example, I get 2-3x. (EDIT: the total may be mixed with prompt processing)
https://github.com/turboderp/exui
-mode rawhttps://github.com/turboderp/exllamav2/blob/master/examples/chat.pyI also played with the 7B fp16 medusa model via their commandline interface, with default settings. This gave a consistent speedup of mostly 2x on the chat responses, compared to the original transformers.
@ggerganov Got it. I would like to see if I can help. This is an amazing project. If you would be so kind to point me in the right direction in terms of source code.
With #3749 now merged, the batched decoding performance for F16 models has been significantly improved.
A few speculative decoding tests on A100 from today achieve 2-3x speed-up using Codellama 34B Target + Codellama 7B Q4_0 Draft:
https://twitter.com/ggerganov/status/1716727296269193702
Here are some examples:
@LiuXiaoxuanPKU Let us know if you attempt more A100 experiments and make sure to use the latest version of
llama.cppto get the best performance. Hope the numbers match better with the expectation now.Yes, my assumption might be incorrect. At least in
llama.cpp, currently the timeTNto verifyNtokens in a batch is not the same as the timeT1for 1 token. Typically, we haveT1 < TN < N*T1, except for very small values ofN(e.g. ~2, 3)Yes.
st_ppis the prompt processing speed with a certain batch size. Also known as prefill phase. We assume that when we verify a draft with sizegamma, the speed is the same as the prompt processing speed with batch sizegammaIf you perform a sequential rejection with acceptance probability per token
p, your total acceptance rate will not be equal top. Here is a script to compute the acceptance rate for a given draft sizeNand acceptance probabilityp:So for a draft size of 8, you can use
p = 0.95to get ~80% total acceptance rate:Here is the diff on
master:Alternatively, you can do what I did in my previous comment - simply accept the first
0.8*Ntokens unconditionally:Hi
when token acceptance rate ~ 60%
I’m still confused about the performance on A100…
For the formula:
The paper makes the assumption that:
for the target model, the time to verify
gamma+1tokens is the same as generating a single token. Therefore,cis the ratio between the time for a single run of the draft model and the time for a single run of the target model. I think your way of calculation (My understanding is that s_avg is the average speed in speculative mode where we take into account the speed both for drafting and evaluating the drafted batch on the target model.) will underestimate the speedup a bit, but I guess it’s good for now.I want to confirm that my understanding of variables is correct:
st_tgis speed of target model generation phase,st_ppis the speed of target model prompt phase.Could you demonstrate how you set the token acceptance rate to 80%? I try to generate some random values between 0-1 and accept the token when the random value is < 0.8, still modify the condition here, but it cannot strictly fix the token acceptance rate to 0.8.
Thanks for the correction, yeah I plugin the wrong numbers, your calculation is correct.
I will also try to benchmark on V100 today/tmr, and will let you know the numbers. Thanks for the detailed experiment!