llama.cpp: New kv_cache API insufficient to restore model state

I may be doing something wrong or misunderstanding the purpose of the kv_cache API but I believe the recent PR #685 by @chrfalch which added the ability to get / set the kv_cache is still insufficient to restore the state of the model even when resetting external model state such as last_n_tokens_data and n_past.

Here is a minimal example

#include "llama.h"
#include <vector>
#include <iostream>

using namespace std;

int main() {
    // init
    auto params = llama_context_default_params();
    auto ctx = llama_init_from_file("../../models/ggml-model.bin", params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto prompt = "The quick brown fox";
    auto n_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    // evaluate prompt
    llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
    auto last_n_tokens_size = 64;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_tokens);
    auto n_past = n_tokens;

    // save state
    auto kv_cache_size = llama_get_kv_cache_size(ctx);
    auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
    auto kv_cache = llama_get_kv_cache(ctx);
    auto kv_cache_copy = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
    auto n_past_copy = n_past;
    auto last_n_tokens_data_copy = vector<llama_token>(last_n_tokens_data);
    
    // first run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //

    // restore state
    llama_set_kv_cache(ctx, kv_cache_copy.data(), kv_cache_size, kv_cache_token_count);
    last_n_tokens_data = last_n_tokens_data_copy;
    n_past = n_past_copy;
    //

    // second run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //
    return 0;
}

I’d expect the following output

The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog

But instead I get

The quick brown fox jumps over the lazy dog
The quick brown fox.
The quick brown fo

Which implies the model is still generating from the end of the first run.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 2
Comments: 23 (9 by maintainers)

Most upvoted comments

@edp1096 Here is your example adapted to work with llama_copy_state_data & llama_set_state_data. The main difference is that you need to allocate memory before retrieving state data with llama_copy_state_data. Reason for this is, that random number generator state, logits, embeddings and kvcache are not in one shared memory block, so simply returning a single pointer is not easily possible. Restoring the first generated token for segfault free sampling after loading is not necessary anymore.

#include <vector>
#include <iostream>
#include <chrono>

#include "llama.h"
#include "llama.cpp"

using namespace std;

int main() {
    auto seed = 42;
    auto thread_count = 4;
    auto last_n_tokens_size = 64;
    auto prompt = "The quick brown fox";
    auto model_path = "../../ggml-vicuna-7b-4bit.bin";


    auto n_past = 0;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);

    // init
    auto params = llama_context_default_params();
    params.seed = seed;
    auto ctx = llama_init_from_file(model_path, params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        cout << "Failed to tokenize prompt" << endl;
        return 1;
    }

    // evaluate prompt

    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, thread_count);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    // Save state (rng, logits, embedding and kv_cache) to file
    FILE *fp_write = fopen("dump_state.bin", "wb");
    auto state_size = llama_get_state_size(ctx);
    auto state_mem = new uint8_t[state_size];
    llama_copy_state_data(ctx, state_mem); // could also copy directly to memory mapped file
    fwrite(state_mem, 1, state_size, fp_write);
    fclose(fp_write);

    // save state (last tokens)
    auto last_n_tokens_data_saved = vector<llama_token>(last_n_tokens_data);
    auto n_past_saved = n_past;

    // save first generated token
    auto first_generated_token = llama_token(0);

    // first run
    cout << endl
         << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        if (i == 0) {
            first_generated_token = next_token;
        }
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;

    // free old model
    llama_free(ctx);

    // load new model
    params = llama_context_default_params();
    params.seed = seed;

    auto ctx2 = llama_init_from_file(model_path, params);

    // Load state (rng, logits, embedding and kv_cache) from file
    FILE *fp_read = fopen("dump_state.bin", "rb");
    auto state_size2 = llama_get_state_size(ctx2);
    if (state_size != state_size2) {
        cerr << "state size differs\n";
    }
    fread(state_mem, 1, state_size, fp_read);
    llama_set_state_data(ctx2, state_mem);  // could also read directly from memory mapped file
    fclose(fp_read);

    // restore state (last tokens)
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // this should not be necessary with llama_copy_state_data & llama_set_state_data as they will save and restore logits.
    
    // // restore first generated token so we can safely sample
    // llama_eval(
    //     ctx2,
    //     &first_generated_token,
    //     1,
    //     n_past,
    //     thread_count);
    // last_n_tokens_data.push_back(first_generated_token);
    // n_past += 1;
    // cout << endl << prompt << llama_token_to_str(ctx2, first_generated_token);
    
    // second run
    for (auto i = 0; i < 5; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx2,
            &last_n_tokens_data.back() - last_n_tokens_size,
            last_n_tokens_size,
            40,
            1.0,
            1.0,
            1.1);
        auto next_token_str = llama_token_to_str(ctx2, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        if (llama_eval(ctx2, &next_token, 1, n_past, thread_count)) {
            cout << endl
                 << "Failed to evaluate" << endl;
            return 1;
        }
        n_past += 1;
    }
    cout << endl
         << endl;
    return 0;
}

xaedes on Apr 22, 2023

Ah sorry - I forgot to mention there is now new interface for saving / loading the llama state:

https://github.com/ggerganov/llama.cpp/pull/1105

I think you should try to use the new functions:

    // Returns the size in bytes of the state (rng, logits, embedding and kv_cache)
    LLAMA_API size_t llama_get_state_size(struct llama_context * ctx);

    // Copies the state to the specified destination address.
    // Destination needs to have allocated enough memory.
    // Returns the number of bytes copied
    LLAMA_API size_t llama_copy_state_data(struct llama_context * ctx, uint8_t * dest);

    // Set the state reading from the specified address
    // Returns the number of bytes read
    LLAMA_API size_t llama_set_state_data(struct llama_context * ctx, const uint8_t * src);

The old interface will likely be removed at some point if the above works:

    // Returns the KV cache that will contain the context for the
    // ongoing prediction with the model.
    LLAMA_API const uint8_t * llama_get_kv_cache(struct llama_context * ctx);
    // Returns the size of the KV cache
    LLAMA_API size_t llama_get_kv_cache_size(struct llama_context * ctx);
    // Returns the number of tokens in the KV cache
    LLAMA_API int llama_get_kv_cache_token_count(struct llama_context * ctx);
    // Sets the KV cache containing the current context for the model
    LLAMA_API void llama_set_kv_cache(
            struct llama_context * ctx,
                   const uint8_t * kv_cache,
                          size_t   n_size,
                             int   n_token_count);

ggerganov on Apr 22, 2023

@s2kjn93h

Yes, that is the expected behaviour. The seed is used when initializing the random number generator with llama_init_from_file. The state of the random number generator is then saved by llama_copy_state_data and will be restored with llama_set_state_data, so that the sampling results remain consistent.

When a different seed is set for a new context and the state is then loaded with llama_set_state_data, the random number generator will be in the state from when llama_copy_state_data was called, i.e. from the previous run.

If you want to generate other numbers after loading the llama state, i.e. to sample different tokens than the saved state would have, you can call llama_sample_top_p_top_k and discard the sampled token, as many times as you wish.

It could be beneficial to have API functions for more precise control in the future. In the meantime, we can gain greater control by directly altering the memory allocated for the llama state.

The code block for llama_copy_state_data demonstrates how to write a random number generator state to memory. https://github.com/ggerganov/llama.cpp/blob/9b0a4d421459f4e5e1af735c9784c3247b379025/llama.cpp#L2116-L2129

Here is an example of initializing the random number generator with seed = 42 * 1337:

// get state from ctx
const size_t state_size = llama_get_state_size(ctx);
uint8_t * state_memory = new uint8_t[state_size];
llama_copy_state_data(ctx, state_memory);

// the rng we want to set in ctx
int seed = 42 * 1337;
auto rng = std::mt19937(seed);

// copy rng to state_memory (code taken from llama_copy_state_data)
#define LLAMA_MAX_RNG_STATE 64*1024
uint8_t * out = state_memory;
{
    std::stringstream rng_ss;
    rng_ss << rng;

    const size_t rng_size = rng_ss.str().size();
    char rng_buf[LLAMA_MAX_RNG_STATE];

    memset(&rng_buf[0], 0, LLAMA_MAX_RNG_STATE);
    memcpy(&rng_buf[0], rng_ss.str().data(), rng_ss.str().size());

    memcpy(out, &rng_size,   sizeof(rng_size));    out += sizeof(rng_size);
    memcpy(out, &rng_buf[0], LLAMA_MAX_RNG_STATE); out += LLAMA_MAX_RNG_STATE;
}

// set our rng in the ctx by setting state from state_memory
llama_set_state_data(ctx, state_memory);

xaedes on Apr 24, 2023

@ggerganov fantastic, I can confirm that example 2 from this comment does work however the first example still causes a segfault. I assume that’s because some buffers are being accessed in sample that are only initialised on the first eval.

Hmm, just looking at the code, seems like everything should be initialized. Will take a deeper look later if this problem remains unsolved

ggerganov on Apr 22, 2023

Is there any difference between using kv_self.buf and this?

Yes, kv_self.buf contains more stuff than just the tensors.

Reading and writing kv_self.k/v.data as binary will work as long as the context length and KV floating point type is exactly the same. If this were exposed from llama.h then for most applications it would be sufficient, IMHO.

SlyEcho on Apr 20, 2023

By the way, would it be possible to save and load them to a file? - @edp1096

Sure, you may serialize and deserialize your structure to byte array. You should keep in mind that pointers should be serialized correctly. If you exceriencing issues with memory bugs, try address sanitizers (or valgring --tool=memcheck, but it is slower).

ivanstepanovftw on Apr 20, 2023

I tried with something like this:

size_t llama_get_kv_length(const struct llama_context * ctx, int n_past) {
    return ctx->model.hparams.n_layer * ctx->model.hparams.n_embd * n_past;
}

size_t llama_get_kv_float_size(const struct llama_context * ctx) {
    return ggml_element_size(ctx->model.kv_self.k);
}

void llama_get_kv_data(const struct llama_context * ctx, int n_past, void *kout, void *vout) {
    const auto & model   = ctx->model;
    const auto & hparams = model.hparams;

    auto & kv_self = model.kv_self;

    LLAMA_ASSERT(!!kv_self.ctx);

    const uint64_t n_embd  = hparams.n_embd;
    const uint64_t n_layer = hparams.n_layer;
    const uint64_t n_ctx   = hparams.n_ctx;

    const uint64_t nb = ggml_element_size(kv_self.k);

    char buffer[4096]; // enough?
    ggml_context *cpy_ctx = ggml_init({ sizeof(buffer), buffer, true });
    ggml_cgraph gf{};
    gf.n_threads = 1;


    ggml_tensor * kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, n_past, n_layer);
    kout3d->data = kout;

    ggml_tensor * vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_past, n_embd, n_layer);
    vout3d->data = vout;

    ggml_tensor * k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, n_past, n_layer, nb*n_embd, nb*n_embd*n_ctx, 0);
    ggml_tensor * v3d = ggml_view_3d(cpy_ctx, kv_self.v, n_past, n_embd, n_layer, nb*n_ctx, nb*n_ctx*n_embd, 0);

    ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d));
    ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d));
    ggml_graph_compute(cpy_ctx, &gf);
}

This will copy the data to the user’s buffer, that need to have sufficient space, but also that’s why more functions to query size. If the full context is needed, then it would be simpler. Also, since ggml_cpy can change data types, it may be possible to let the user extract only in f32 or f16, this code only gives whatever format is used currently by the model.

SlyEcho on Apr 19, 2023

It is correct that the pr does not implement this - but it describes that last tokens etc is is needed to save full state 😃 just wanted to implement the missing api for implementing the functionality for a prompt saving mechanism.

chrfalch on Apr 3, 2023

Woops, sorry I just realized you obviously still need to eval the prompt tokens again.

Here’s is the working version for future reference.

#include "llama.h"
#include <vector>
#include <iostream>

using namespace std;

int main() {
    // init
    auto params = llama_context_default_params();
    auto ctx = llama_init_from_file("../../models/ggml-model.bin", params);
    auto tokens = vector<llama_token>(params.n_ctx);
    auto prompt = "The quick brown fox";
    auto n_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);

    // evaluate prompt
    llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
    auto last_n_tokens_size = 64;
    auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_tokens);
    auto n_past = n_tokens;

    // save state
    auto kv_cache_size = llama_get_kv_cache_size(ctx);
    auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
    auto kv_cache = llama_get_kv_cache(ctx);
    auto kv_cache_copy = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
    auto n_past_copy = n_past;
    auto last_n_tokens_data_copy = vector<llama_token>(last_n_tokens_data);
    
    // first run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //

    // restore state
    llama_set_kv_cache(ctx, kv_cache_copy.data(), kv_cache_size, kv_cache_token_count);
    last_n_tokens_data = last_n_tokens_data_copy;
    n_past = n_past_copy;
    // call eval again on prompt tokens
    llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
    //

    // second run
    cout << prompt;
    for (auto i = 0; i < 6; i++) {
        auto next_token = llama_sample_top_p_top_k(
            ctx,
            last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
            last_n_tokens_size,
            1,
            1.0,
            0.0,
            1.1
        );
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);
        cout << next_token_str;
        llama_eval(ctx, &next_token, 1, n_past, 12);
        n_past += 1;
    }
    cout << endl;
    //
    return 0;
}

abetlen on Apr 3, 2023

It works great for me! Thank you @xaedes !

edp1096 on Apr 22, 2023

@ggerganov I believe the issue is that llama_sample_top_p_top_k is expecting the logits but they’re not being saved and restored with this kv_cache approach.

Adding a check here and running that first example I gave seems to reveal the issue https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L1493

    const int n_logits = lctx.model.hparams.n_vocab;

    LLAMA_ASSERT(lctx.logits.size() > 0);
    const auto & logits = lctx.logits;
    const auto * plogits = logits.data() + logits.size() - n_logits;

As you can see, n_logits is going to just be the length of the vocab and logits will be a size 0 vector causing an illegal memory access and the resulting segfault.

abetlen on Apr 22, 2023

@ggerganov fantastic, I can confirm that example 2 from this comment does work however the first example still causes a segfault. I assume that’s because some buffers are being accessed in sample that are only initialised on the first eval.

abetlen on Apr 21, 2023

@abetlen and all

I think this commit should fix the issue: https://github.com/ggerganov/llama.cpp/commit/8687c1f2581d059cd5b6a9502f89bd343566062a

ggerganov on Apr 21, 2023

A great test is to give the model a prompt saying its name, and then the test would be to ask it “What’s your name” - if it responds with the correct name everything works.

Here is how I have implemented saving the “state” of the model:

Save kv_cache size by calling llama_get_kv_cache_size
Get memory in kv_cache by calling llama_get_kv_cache
Save token count by calling llama_get_kv_cache_token_count
Save n_past
Save last_n_tokens and its size

This is all you need. After you have eval’ed the prompt you do the above steps and save the results.

Then to restore you do the opposite - and test it all with the AI name prompt trick.

chrfalch on Apr 13, 2023

@ivanstepanovftw do you have an example? I’ve tried not eval’ing but in that case even with n_past saved the model fails to generate the same output (just random generation).

abetlen on Apr 13, 2023

I dont think you need to eval initial prompt, because you’d wanted to avoid this

ivanstepanovftw on Apr 13, 2023

@chrfalch sorry to bug you again on this one but I think I’m missing something.

From my understanding based on your response you should be able to save the internal state to disk assuming you also save n_past and last_n_tokens, however I’m still not able to do this correctly / in a way that reduces processing time once the model is reloaded.

llama_init_from_file the model.
Tokenize the initial prompt (e.g. the quick brown fox jumps)
llama_eval the prompt and set n_past to the number of prompt tokens, and start to fill last_n_tokens_data with the prompt tokens.
Save kv_cache, kv_cache_size, kv_cache_token_count, n_past and last_n_tokens.
Generate some tokens in a llama_sample_top_p_top_k / llama_eval loop (e.g over the lazy dog)
llama_free the context
Reload via llama_init_from_file
Restore the state via llama_set_kv_cache with the saved values from above.
Restore n_past and last_n_tokens to the saved value.
Generate some tokens in a llama_sample_top_p_top_k / llama_eval loop (e.g over the lazy dog) starting from the saved value of n_past and last_n_tokens

I would now expect to get the same output based on the original prompt e.g. over the lazy dog but it seems that the model is not taking this into account and instead I get back a random response.

Appreciate any help on this one, cheers.

EDIT: And just to clarify, if I call eval after restoring the kv_cache as I did above it doesn’t seem to reduce processing time.

abetlen on Apr 11, 2023