llama.cpp: Speculative Decoding is slower than expected on A100

Thanks for the great project! I am benchmarking the performance of llamacpp with speculative decoding.

  • model setting: draft model llama-160m, target model llama7b. When I benchmark on Mac M1 chip, the results look great: speculative decoding increases the speed from ~12 tokens/s to ~16 tokens/s. However, the performance is not very good on A100. Concretely, the speed of the target model and draft model are:
draft:

llama_print_timings:        load time =      65.11 ms
llama_print_timings:      sample time =     524.95 ms /     1 runs   (  524.95 ms per token,     1.90 tokens per second)
llama_print_timings: prompt eval time =       8.59 ms /    94 tokens (    0.09 ms per token, 10946.78 tokens per second)
llama_print_timings:        eval time =     322.80 ms /   216 runs   (    1.49 ms per token,   669.15 tokens per second)
llama_print_timings:       total time =    2924.72 ms

target:

llama_print_timings:        load time =    1144.77 ms
llama_print_timings:      sample time =       4.02 ms /   259 runs   (    0.02 ms per token, 64411.84 tokens per second)
llama_print_timings: prompt eval time =    1939.02 ms /   351 tokens (    5.52 ms per token,   181.02 tokens per second)
llama_print_timings:        eval time =      13.19 ms /     1 runs   (   13.19 ms per token,    75.82 tokens per second)
llama_print_timings:       total time =    2999.59 ms

I am using greedy decoding and disabling all the heuristics (fix n_draft, always propose n_draft tokens and avoid early stopping). My execution cmd is:

./build/bin/speculative \
-ngl 1000 \
-ngld 100 \
-m /data/model/llama-7b/ggml-model-f16.gguf \
-md /data/model/lama-160m/ggml-model-f16.gguf \
-p "${prompt}" \
-e --temp "-1" -n 256 -s 1 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5

When token acceptance rate is 0.44, speculative decoding is actually slower (notice 50 tokens/s < 75 tokens/s)

encoded   94 tokens in    0.076 seconds, speed: 1231.914 t/s
decoded  108 tokens in    2.145 seconds, speed:   50.341 t/s

n_draft   = 5
n_predict = 108
n_drafted = 165
n_accept  = 74
accept    = 44.848%

However, based on the original speculative paper, the speedup should be: Screen Shot 2023-10-16 at 9 10 58 PM where alpha is the token acceptance rate, gamma is the number of tokens proposed each step, and c is the ratio between the execution times of the draft and target models. In the example above, c is roughly 76/669=0.11. Plugin in the numbers above, the expected speedup should be: (1-0.44^6)/[(1-0.44)*(0.11*0.44+1)]=1.69x. However, the benchmarking results show that it’s actually 50/76=0.66x.

To debug this, I set the token acceptance rate to 100% by removing the id==draft_id[i_dft] here. After doing this, I observe that the speed is ~90tokens/s, which brings 90/76=1.18x speedup. However, this is much smaller than the calculation with the formula above (I use 0.99 as the token acceptance rate instead of 1): (1-0.99^6)/[(1-0.99)*(0.11*0.99+1)]=5.27x.

I wonder which part of the speculative decoding might cause big overhead, any comments are highly appreciated! Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Reactions: 2
  • Comments: 17 (10 by maintainers)

Most upvoted comments

I might be missing something, but I think there is an error in the number representations of the equation. In the first case, for example, it should be:

(1-0.44^6)/[(1-0.44)*(0.11*5+1)]= 1.14x

because gamma in the denominator is 5 - not 0.44

I did some more testing on a V100 16GB GPU using the same models. Here is my script to determine the theoretical speed-up according to the paper:

import numpy as np

# case 0, alpha = 0.8, gamma = 5
sd_tg = 619.4
st_tg = 52.5
st_pp = 121.7 # pp 5

s_avg = (sd_tg + st_pp)/2

a = 0.80
c = st_tg/s_avg
g = 5

speed = (1 - a**(g + 1))/((1 - a)*(c*g + 1))

print(speed)

# case 0, alpha = 0.875, gamma = 8
sd_tg = 625
st_tg = 52.5
st_pp = 194.6 # pp 8

s_avg = (sd_tg + st_pp)/2

a = 0.875
c = st_tg/s_avg
g = 8

speed = (1 - a**(g + 1))/((1 - a)*(c*g + 1))

print(speed)

In case 0 I run the following:

make -j && ./bin/speculative -ngl 1000 -ngld 100 -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -md /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p "${prompt}" -e --temp -1 -n 256 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1

encoded   58 tokens in    0.111 seconds, speed:  521.414 t/s
decoded  261 tokens in    2.967 seconds, speed:   87.954 t/s

n_draft   = 5
n_predict = 261
n_drafted = 260
n_accept  = 208
accept    = 80.000%

draft:

llama_print_timings:        load time =     125.62 ms
llama_print_timings:      sample time =       9.58 ms /   260 runs   (    0.04 ms per token, 27128.55 tokens per second)
llama_print_timings: prompt eval time =      10.95 ms /    58 tokens (    0.19 ms per token,  5296.80 tokens per second)
llama_print_timings:        eval time =     480.14 ms /   261 runs   (    1.84 ms per token,   543.59 tokens per second)
llama_print_timings:       total time =    3079.20 ms

target:

llama_print_timings:        load time =    2827.23 ms
llama_print_timings:      sample time =       9.84 ms /   261 runs   (    0.04 ms per token, 26513.61 tokens per second)
llama_print_timings: prompt eval time =    2387.92 ms /   369 tokens (    6.47 ms per token,   154.53 tokens per second)
llama_print_timings:        eval time =      19.31 ms /     1 runs   (   19.31 ms per token,    51.80 tokens per second)
llama_print_timings:       total time =    3219.78 ms

In case 1 I run this:

make -j && ./bin/speculative -ngl 1000 -ngld 100 -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -md /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p "${prompt}" -e --temp -1 -n 256 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1

encoded   58 tokens in    0.110 seconds, speed:  525.272 t/s
decoded  257 tokens in    2.083 seconds, speed:  123.394 t/s

n_draft   = 8
n_predict = 257
n_drafted = 256
n_accept  = 224
accept    = 87.500%

draft:

llama_print_timings:        load time =     125.64 ms
llama_print_timings:      sample time =       9.42 ms /   256 runs   (    0.04 ms per token, 27190.65 tokens per second)
llama_print_timings: prompt eval time =      10.97 ms /    58 tokens (    0.19 ms per token,  5286.66 tokens per second)
llama_print_timings:        eval time =     446.25 ms /   257 runs   (    1.74 ms per token,   575.91 tokens per second)
llama_print_timings:       total time =    2193.53 ms

target:

llama_print_timings:        load time =    2770.13 ms
llama_print_timings:      sample time =      11.02 ms /   257 runs   (    0.04 ms per token, 23323.35 tokens per second)
llama_print_timings: prompt eval time =    1543.14 ms /   345 tokens (    4.47 ms per token,   223.57 tokens per second)
llama_print_timings:        eval time =      19.56 ms /     1 runs   (   19.56 ms per token,    51.12 tokens per second)
llama_print_timings:       total time =    2334.47 ms

I am using the branch in #3624 with -np 1 it should be equivalent to master. I have applied the following patch to simulate high-acceptance rate (similar to your changes):

--- a/examples/speculative/speculative.cpp
+++ b/examples/speculative/speculative.cpp
@@ -174,7 +174,7 @@ int main(int argc, char ** argv) {
                         continue;
                     }
 
-                    if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+                    if (i_dft < (int) drafts[s].tokens.size() - 1) {
                         LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
 
                         s_keep = s;
@@ -273,11 +273,11 @@ int main(int argc, char ** argv) {
                 }
 
                 // TODO: make this configurable
-                if (cur_p[0].p < 0.4) {
-                    LOG("stopping drafting for seq %3d, probability too low: %.3f < 2*%.3f\n", s, cur_p[0].p, cur_p[1].p);
-                    drafts[s].drafting = false;
-                    continue;
-                }
+                //if (cur_p[0].p < 0.4) {
+                //    LOG("stopping drafting for seq %3d, probability too low: %.3f < 2*%.3f\n", s, cur_p[0].p, cur_p[1].p);
+                //    drafts[s].drafting = false;
+                //    continue;
+                //}
 
                 std::vector<int> sa(1, s);

To determine sd_tg, st_tg and st_pp I run the following benchmark commands:

# draft model text-generation bench
./bin/llama-bench -m /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p 0 -n 128 -ngl 99

  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama ?B mostly F16            | 309.82 MiB |   162.42 M | CUDA       |  99 | tg 128     |   619.43 ± 20.03 |
# target model PP and TG bench
./bin/llama-bench -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -p 1,2,3,4,5,6,7,8,64,128,256,512 -n 128 -ngl 99

  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 1       |     30.72 ± 3.88 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 2       |     51.22 ± 0.43 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 3       |     75.92 ± 0.64 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 4       |    101.66 ± 0.29 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 5       |    121.74 ± 0.55 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 6       |    146.37 ± 0.35 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 7       |    163.65 ± 0.46 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 8       |    194.58 ± 0.72 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 64      |   899.08 ± 23.64 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 128     |  1708.83 ± 56.78 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 256     |  2515.20 ± 10.46 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 512     |   2866.48 ± 1.45 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | tg 128     |     52.49 ± 0.13 |

For sd_tg I pick the result from the first bench.

For st_tg I pick the tg 128 result from the second bench. For st_pp I pick the respective pp g value based on the value of g.

My understanding is that s_avg is the average speed in speculative mode where we take into account the speed both for drafting and evaluating the drafted batch on the target model.

The theoretical results this way are as follows:

python3 speed.py
2.1594861448542773
2.7629832057356403

While the observed are:

87.954 / 52.49 = 1.675633453991236
123.394 / 52.49 = 2.350809678033911

Thank you for the detailed report - very useful information!

~If you add -nommq CLI arg, do the numbers improve?~ (Edit: nvm, -nommq does not make a difference for F16 models)

I’ll try to do the same test today and see if I can find the bottleneck.

Hi @ggerganov

I did some tests with the speculative example, and some quantizations appear to be fine when running 72% of the model in vram.

In the case of running this setup in the speculative example with a 70B Q3_K_S I get 1.3x speedup on all chat formats, offloading 57 layers to a 3090, with top-k 1 and all layers of the 1.5T tinyllama base Q4_K_M model. Where my 5.4 t/s goes up to 7.1 t/s.

This is the same speedup factor I am getting on pure cpu speculative sampling with the model. (1.3x, where 1.5 t/s goes to 2 t/s)

It’s a general speedup, and shouldn’t be limited to coding examples, it works for instruct/chat formats.

I also tried exllamav2’s speculative sampling examples and sampling parameters. These give a speedup of mostly 1.5x on the chat responses. When running the Quicksort code example, I get 2-3x. (EDIT: the total may be mixed with prompt processing)

I also played with the 7B fp16 medusa model via their commandline interface, with default settings. This gave a consistent speedup of mostly 2x on the chat responses, compared to the original transformers.

@ggerganov Got it. I would like to see if I can help. This is an amazing project. If you would be so kind to point me in the right direction in terms of source code.

With #3749 now merged, the batched decoding performance for F16 models has been significantly improved.

A few speculative decoding tests on A100 from today achieve 2-3x speed-up using Codellama 34B Target + Codellama 7B Q4_0 Draft:

https://twitter.com/ggerganov/status/1716727296269193702

Here are some examples:

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-f16.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 # Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

# Time complexity: O(V^2)
# Space complexity: O(V)

def dijkstra(graph, start):
    distances = {vertex: float('inf') for vertex in graph}
    previous_vertices = {vertex: None for vertex in graph}
    distances[start] = 0
    vertices = list(graph.keys())

    while len(vertices) > 0:
        current_vertex = min(vertices, key=lambda vertex: distances[vertex])
        vertices.remove(current_vertex)

        if distances[current_vertex] == float('inf'):
            break

        for neighbor in graph[current_vertex]:
            distance = distances[current_vertex] + graph[current_vertex][neighbor]

            if distance < distances[neighbor]:
                distances[neighbor] = distance
                previous_vertices[neighbor] = current_vertex

    return distances, previous_vertices

encoded   25 tokens in    0.129 seconds, speed:  193.979 t/s
decoded  238 tokens in    6.033 seconds, speed:   39.453 t/s

n_draft   = 16
n_predict = 238
n_drafted = 323
n_accept  = 215
accept    = 66.563%

draft:

llama_print_timings:        load time =    2186.21 ms
llama_print_timings:      sample time =      45.08 ms /   326 runs   (    0.14 ms per token,  7231.91 tokens per second)
llama_print_timings: prompt eval time =      19.28 ms /    25 tokens (    0.77 ms per token,  1296.88 tokens per second)
llama_print_timings:        eval time =    4599.92 ms /   346 runs   (   13.29 ms per token,    75.22 tokens per second)
llama_print_timings:       total time =    6161.42 ms

target:

llama_print_timings:        load time =    6532.70 ms
llama_print_timings:      sample time =      28.60 ms /   238 runs   (    0.12 ms per token,  8320.81 tokens per second)
llama_print_timings: prompt eval time =    1301.73 ms /   369 tokens (    3.53 ms per token,   283.47 tokens per second)
llama_print_timings:        eval time =      50.88 ms /     1 runs   (   50.88 ms per token,    19.65 tokens per second)
llama_print_timings:       total time =    8365.12 ms

---

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-q4_0.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 # Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

# Time complexity: O(V^2)
# Space complexity: O(V)

def dijkstra(graph, start):
    distances = {vertex: float('inf') for vertex in graph}
    previous_vertices = {vertex: None for vertex in graph}
    distances[start] = 0
    vertices = list(graph.keys())

    while len(vertices) > 0:
        current_vertex = min(vertices, key=lambda vertex: distances[vertex])
        vertices.remove(current_vertex)

        if distances[current_vertex] == float('inf'):
            break

        for neighbor in graph[current_vertex]:
            distance = distances[current_vertex] + graph[current_vertex][neighbor]

            if distance < distances[neighbor]:
                distances[neighbor] = distance
                previous_vertices[neighbor] = current_vertex

    return distances, previous_vertices

encoded   25 tokens in    0.199 seconds, speed:  125.459 t/s
decoded  238 tokens in    4.270 seconds, speed:   55.736 t/s

n_draft   = 16
n_predict = 238
n_drafted = 365
n_accept  = 214
accept    = 58.630%

draft:

llama_print_timings:        load time =    1195.22 ms
llama_print_timings:      sample time =      49.97 ms /   366 runs   (    0.14 ms per token,  7323.95 tokens per second)
llama_print_timings: prompt eval time =      89.48 ms /    25 tokens (    3.58 ms per token,   279.41 tokens per second)
llama_print_timings:        eval time =    2766.19 ms /   389 runs   (    7.11 ms per token,   140.63 tokens per second)
llama_print_timings:       total time =    4469.67 ms

target:

llama_print_timings:        load time =    6555.17 ms
llama_print_timings:      sample time =      28.34 ms /   238 runs   (    0.12 ms per token,  8397.14 tokens per second)
llama_print_timings: prompt eval time =    1361.73 ms /   412 tokens (    3.31 ms per token,   302.56 tokens per second)
llama_print_timings:        eval time =      51.10 ms /     1 runs   (   51.10 ms per token,    19.57 tokens per second)
llama_print_timings:       total time =    5682.48 ms

---

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-q4_0.gguf -p "// Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 // Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:

// https://en.wikipedia.org/wiki/Graham_scan

#include <iostream>
#include <vector>
#include <algorithm>
#include <cmath>
using namespace std;

struct Point {
    double x, y;
};

double cross(const Point &o, const Point &a, const Point &b) {
    return (a.x - o.x) * (b.y - o.y) - (a.y - o.y) * (b.x - o.x);
}

bool cmp(const Point &a, const Point &b) {
    if (a.x != b.x) return a.x < b.x;
    return a.y < b.y;
}

vector<Point> convex_hull(vector<Point> points) {
    int n = points.size(), k = 0;
    vector<Point> hull(2 * n);
    sort(points.begin(), points.end(), cmp);
    for (int i = 0; i < n; hull[k++] = points[i++]) {
        while (k >= 2 && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    for (int i = n - 2, t = k + 1; i >= 0; hull[k++] = points[i--]) {
        while (k >= t && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    return vector<Point>(hull.begin(), hull.begin() + k - (k > 1));
}

int main() {
    int n; cin >> n;
    vector<Point> points(n);
    for (int i = 0; i < n; ++i) {
        cin >> points[i].x >> points[i].y;
    }
    auto hull = convex_hull(points);
    cout << "The convex hull has " << hull.size() << " points\n";
}

encoded   29 tokens in    0.200 seconds, speed:  144.745 t/s
decoded  510 tokens in    7.867 seconds, speed:   64.824 t/s

n_draft   = 16
n_predict = 510
n_drafted = 642
n_accept  = 467
accept    = 72.741%

draft:

llama_print_timings:        load time =    1163.29 ms
llama_print_timings:      sample time =      87.45 ms /   644 runs   (    0.14 ms per token,  7363.96 tokens per second)
llama_print_timings: prompt eval time =      89.61 ms /    29 tokens (    3.09 ms per token,   323.64 tokens per second)
llama_print_timings:        eval time =    5088.07 ms /   685 runs   (    7.43 ms per token,   134.63 tokens per second)
llama_print_timings:       total time =    8067.89 ms

target:

llama_print_timings:        load time =    6548.10 ms
llama_print_timings:      sample time =      60.25 ms /   510 runs   (    0.12 ms per token,  8464.17 tokens per second)
llama_print_timings: prompt eval time =    2405.81 ms /   711 tokens (    3.38 ms per token,   295.53 tokens per second)
llama_print_timings:        eval time =     103.98 ms /     2 runs   (   51.99 ms per token,    19.23 tokens per second)
llama_print_timings:       total time =    9248.23 ms

---

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-f16.gguf -p "// Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 // Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:

// https://en.wikipedia.org/wiki/Graham_scan

#include <iostream>
#include <vector>
#include <algorithm>
#include <cmath>
using namespace std;

struct Point {
    double x, y;
};

double cross(const Point &o, const Point &a, const Point &b) {
    return (a.x - o.x) * (b.y - o.y) - (a.y - o.y) * (b.x - o.x);
}

bool cmp(const Point &a, const Point &b) {
    if (a.x != b.x) return a.x < b.x;
    return a.y < b.y;
}

vector<Point> convex_hull(vector<Point> points) {
    int n = points.size(), k = 0;
    vector<Point> hull(2 * n);
    sort(points.begin(), points.end(), cmp);
    for (int i = 0; i < n; hull[k++] = points[i++]) {
        while (k >= 2 && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    for (int i = n - 2, t = k + 1; i >= 0; hull[k++] = points[i--]) {
        while (k >= t && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    return vector<Point>(hull.begin(), hull.begin() + k - (k > 0));
}

int main() {
    int n; cin >> n;
    vector<Point> points(n);
    for (int i = 0; i < n; ++i) {
        cin >> points[i].x >> points[i].y;
    }
    auto hull = convex_hull(points);
    cout << "The convex hull has " << hull.size() << " points\n";
}

encoded   29 tokens in    0.132 seconds, speed:  219.056 t/s
decoded  510 tokens in   13.127 seconds, speed:   38.850 t/s

n_draft   = 16
n_predict = 510
n_drafted = 703
n_accept  = 465
accept    = 66.145%

draft:

llama_print_timings:        load time =    1860.53 ms
llama_print_timings:      sample time =      95.55 ms /   704 runs   (    0.14 ms per token,  7367.72 tokens per second)
llama_print_timings: prompt eval time =      19.90 ms /    29 tokens (    0.69 ms per token,  1457.43 tokens per second)
llama_print_timings:        eval time =   10197.04 ms /   748 runs   (   13.63 ms per token,    73.35 tokens per second)
llama_print_timings:       total time =   13260.31 ms

target:

llama_print_timings:        load time =    6642.24 ms
llama_print_timings:      sample time =      60.96 ms /   510 runs   (    0.12 ms per token,  8365.87 tokens per second)
llama_print_timings: prompt eval time =    2588.94 ms /   775 tokens (    3.34 ms per token,   299.35 tokens per second)
llama_print_timings:        eval time =      50.84 ms /     1 runs   (   50.84 ms per token,    19.67 tokens per second)
llama_print_timings:       total time =   15140.22 ms

@LiuXiaoxuanPKU Let us know if you attempt more A100 experiments and make sure to use the latest version of llama.cpp to get the best performance. Hope the numbers match better with the expectation now.

  1. Yes, my assumption might be incorrect. At least in llama.cpp, currently the time TN to verify N tokens in a batch is not the same as the time T1 for 1 token. Typically, we have T1 < TN < N*T1, except for very small values of N (e.g. ~2, 3)

  2. Yes. st_pp is the prompt processing speed with a certain batch size. Also known as prefill phase. We assume that when we verify a draft with size gamma, the speed is the same as the prompt processing speed with batch size gamma

  3. If you perform a sequential rejection with acceptance probability per token p, your total acceptance rate will not be equal to p. Here is a script to compute the acceptance rate for a given draft size N and acceptance probability p:

# p - probability to accept a token
#
# probability to accept M tokens from a draft of N:
#
# M   probability
# 0   (1-p)
# 1   (1-p)*p
# 2   (1-p)*p^2
# ...
# N-1 (1-p)*p^(N-1)
# N   p^N
#
# expectation:
#
# E[X] = 0*(1-p) + 1*p*(1-p) + 2*p^2*(1-p) + 3*p^3*(1-p) + ... + (N-1)*p^(N-1)*(1-p) + N*p^N
#

import numpy as np
import sys

N = int(sys.argv[1])
p = float(sys.argv[2])

print("N = ", N)
print("p = ", p)

E = 0

for i in range(N):
    E += i * p**i * (1-p)

E += N * p**N

print("E = ", round(E, 2), " (", round(100*(E/N), 2), "% )")

So for a draft size of 8, you can use p = 0.95 to get ~80% total acceptance rate:

$ ▶ python3 expect.py 8 0.95
N =  8
p =  0.95
E =  6.4  ( 79.94 % )

Here is the diff on master:

--- a/examples/speculative/speculative.cpp
+++ b/examples/speculative/speculative.cpp
@@ -8,6 +8,10 @@
 #include <string>
 #include <vector>
 
+static float frand() {
+    return (float) rand() / RAND_MAX;
+}
+
 struct seq_draft {
     bool active   = false;
     bool drafting = false;
@@ -37,7 +41,7 @@ int main(int argc, char ** argv) {
     const int n_seq_dft = params.n_parallel;
 
     // TODO: make this configurable
-    const float p_accept = 0.80f;
+    const float p_accept = -1.0f; // always draft n_draft tokens
     const float p_split  = 0.10f;
 
 #ifndef LOG_DISABLE_LOGS
@@ -178,7 +182,7 @@ int main(int argc, char ** argv) {
                         continue;
                     }
 
-                    if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+                    if (i_dft < (int) drafts[s].tokens.size() && frand() < 0.95) { // use the python script to find the value that you will give you the desired acceptance rate
                         LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
 
                         s_keep = s;

Alternatively, you can do what I did in my previous comment - simply accept the first 0.8*N tokens unconditionally:

    const int n_seq_dft = params.n_parallel;
 
     // TODO: make this configurable
-    const float p_accept = 0.80f;
+    const float p_accept = -1.0f; // always draft n_draft tokens
     const float p_split  = 0.10f;
 
 #ifndef LOG_DISABLE_LOGS
@@ -178,7 +182,7 @@ int main(int argc, char ** argv) {
                         continue;
                     }
 
-                    if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+                    if (i_dft < 0.8*(int) drafts[s].tokens.size()) {
                         LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
 
                         s_keep = s;

Hi

  1. I benchmarked on V100 and A100. My numbers on V100 are promising, concretely: Speed (generation phase) in tokens/s of draft and target models:
GPU Draft (160M) Target (7B)
A100 616 74
V100 630 51

when token acceptance rate ~ 60%

Speculative decoding with greedy sampling Speedup
A100 63 0.85
V100 69 1.35

I’m still confused about the performance on A100…

  1. For the formula: image The paper makes the assumption that: for the target model, the time to verify gamma+1 tokens is the same as generating a single token. Therefore, c is the ratio between the time for a single run of the draft model and the time for a single run of the target model. I think your way of calculation (My understanding is that s_avg is the average speed in speculative mode where we take into account the speed both for drafting and evaluating the drafted batch on the target model.) will underestimate the speedup a bit, but I guess it’s good for now.

  2. I want to confirm that my understanding of variables is correct:st_tg is speed of target model generation phase, st_pp is the speed of target model prompt phase.

  3. Could you demonstrate how you set the token acceptance rate to 80%? I try to generate some random values between 0-1 and accept the token when the random value is < 0.8, still modify the condition here, but it cannot strictly fix the token acceptance rate to 0.8.

Thanks for the correction, yeah I plugin the wrong numbers, your calculation is correct.

I will also try to benchmark on V100 today/tmr, and will let you know the numbers. Thanks for the detailed experiment!