llama.cpp: New kv_cache API insufficient to restore model state
I may be doing something wrong or misunderstanding the purpose of the kv_cache
API but I believe the recent PR #685 by @chrfalch which added the ability to get / set the kv_cache
is still insufficient to restore the state of the model even when resetting external model state such as last_n_tokens_data
and n_past
.
Here is a minimal example
#include "llama.h"
#include <vector>
#include <iostream>
using namespace std;
int main() {
// init
auto params = llama_context_default_params();
auto ctx = llama_init_from_file("../../models/ggml-model.bin", params);
auto tokens = vector<llama_token>(params.n_ctx);
auto prompt = "The quick brown fox";
auto n_tokens = llama_tokenize(ctx, prompt, tokens.data(), tokens.size(), true);
// evaluate prompt
llama_eval(ctx, tokens.data(), n_tokens, 0, 12);
auto last_n_tokens_size = 64;
auto last_n_tokens_data = vector<llama_token>(last_n_tokens_size, 0);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_tokens);
auto n_past = n_tokens;
// save state
auto kv_cache_size = llama_get_kv_cache_size(ctx);
auto kv_cache_token_count = llama_get_kv_cache_token_count(ctx);
auto kv_cache = llama_get_kv_cache(ctx);
auto kv_cache_copy = vector<uint8_t>(kv_cache, kv_cache + kv_cache_size);
auto n_past_copy = n_past;
auto last_n_tokens_data_copy = vector<llama_token>(last_n_tokens_data);
// first run
cout << prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
last_n_tokens_size,
1,
1.0,
0.0,
1.1
);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
llama_eval(ctx, &next_token, 1, n_past, 12);
n_past += 1;
}
cout << endl;
//
// restore state
llama_set_kv_cache(ctx, kv_cache_copy.data(), kv_cache_size, kv_cache_token_count);
last_n_tokens_data = last_n_tokens_data_copy;
n_past = n_past_copy;
//
// second run
cout << prompt;
for (auto i = 0; i < 6; i++) {
auto next_token = llama_sample_top_p_top_k(
ctx,
last_n_tokens_data.data() + last_n_tokens_data.size() - n_past,
last_n_tokens_size,
1,
1.0,
0.0,
1.1
);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
cout << next_token_str;
llama_eval(ctx, &next_token, 1, n_past, 12);
n_past += 1;
}
cout << endl;
//
return 0;
}
I’d expect the following output
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
But instead I get
The quick brown fox jumps over the lazy dog
The quick brown fox.
The quick brown fo
Which implies the model is still generating from the end of the first run.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 23 (9 by maintainers)
@edp1096 Here is your example adapted to work with llama_copy_state_data & llama_set_state_data. The main difference is that you need to allocate memory before retrieving state data with llama_copy_state_data. Reason for this is, that random number generator state, logits, embeddings and kvcache are not in one shared memory block, so simply returning a single pointer is not easily possible. Restoring the first generated token for segfault free sampling after loading is not necessary anymore.
Ah sorry - I forgot to mention there is now new interface for saving / loading the llama state:
https://github.com/ggerganov/llama.cpp/pull/1105
I think you should try to use the new functions:
The old interface will likely be removed at some point if the above works:
@s2kjn93h
Yes, that is the expected behaviour. The seed is used when initializing the random number generator with llama_init_from_file. The state of the random number generator is then saved by llama_copy_state_data and will be restored with llama_set_state_data, so that the sampling results remain consistent.
When a different seed is set for a new context and the state is then loaded with llama_set_state_data, the random number generator will be in the state from when llama_copy_state_data was called, i.e. from the previous run.
If you want to generate other numbers after loading the llama state, i.e. to sample different tokens than the saved state would have, you can call llama_sample_top_p_top_k and discard the sampled token, as many times as you wish.
It could be beneficial to have API functions for more precise control in the future. In the meantime, we can gain greater control by directly altering the memory allocated for the llama state.
The code block for
llama_copy_state_data
demonstrates how to write a random number generator state to memory. https://github.com/ggerganov/llama.cpp/blob/9b0a4d421459f4e5e1af735c9784c3247b379025/llama.cpp#L2116-L2129Here is an example of initializing the random number generator with seed = 42 * 1337:
Hmm, just looking at the code, seems like everything should be initialized. Will take a deeper look later if this problem remains unsolved
Yes,
kv_self.buf
contains more stuff than just the tensors.Reading and writing
kv_self.k/v.data
as binary will work as long as the context length and KV floating point type is exactly the same. If this were exposed from llama.h then for most applications it would be sufficient, IMHO.Sure, you may serialize and deserialize your structure to byte array. You should keep in mind that pointers should be serialized correctly. If you exceriencing issues with memory bugs, try address sanitizers (or
valgring --tool=memcheck
, but it is slower).I tried with something like this:
This will copy the data to the user’s buffer, that need to have sufficient space, but also that’s why more functions to query size. If the full context is needed, then it would be simpler. Also, since ggml_cpy can change data types, it may be possible to let the user extract only in f32 or f16, this code only gives whatever format is used currently by the model.
It is correct that the pr does not implement this - but it describes that last tokens etc is is needed to save full state 😃 just wanted to implement the missing api for implementing the functionality for a prompt saving mechanism.
Woops, sorry I just realized you obviously still need to
eval
the prompt tokens again.Here’s is the working version for future reference.
It works great for me! Thank you @xaedes !
@ggerganov I believe the issue is that
llama_sample_top_p_top_k
is expecting the logits but they’re not being saved and restored with thiskv_cache
approach.Adding a check here and running that first example I gave seems to reveal the issue https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L1493
As you can see,
n_logits
is going to just be the length of the vocab and logits will be a size 0 vector causing an illegal memory access and the resulting segfault.@ggerganov fantastic, I can confirm that example 2 from this comment does work however the first example still causes a segfault. I assume that’s because some buffers are being accessed in sample that are only initialised on the first eval.
@abetlen and all
I think this commit should fix the issue: https://github.com/ggerganov/llama.cpp/commit/8687c1f2581d059cd5b6a9502f89bd343566062a
A great test is to give the model a prompt saying its name, and then the test would be to ask it “What’s your name” - if it responds with the correct name everything works.
Here is how I have implemented saving the “state” of the model:
llama_get_kv_cache_size
llama_get_kv_cache
llama_get_kv_cache_token_count
n_past
last_n_tokens
and its sizeThis is all you need. After you have eval’ed the prompt you do the above steps and save the results.
Then to restore you do the opposite - and test it all with the AI name prompt trick.
@ivanstepanovftw do you have an example? I’ve tried not eval’ing but in that case even with
n_past
saved the model fails to generate the same output (just random generation).I dont think you need to eval initial prompt, because you’d wanted to avoid this
@chrfalch sorry to bug you again on this one but I think I’m missing something.
From my understanding based on your response you should be able to save the internal state to disk assuming you also save n_past and last_n_tokens, however I’m still not able to do this correctly / in a way that reduces processing time once the model is reloaded.
llama_init_from_file
the model.the quick brown fox jumps
)llama_eval
the prompt and setn_past
to the number of prompt tokens, and start to filllast_n_tokens_data
with the prompt tokens.kv_cache
,kv_cache_size
,kv_cache_token_count
,n_past
andlast_n_tokens
.llama_sample_top_p_top_k
/llama_eval
loop (e.gover the lazy dog
)llama_free
the contextllama_init_from_file
llama_set_kv_cache
with the saved values from above.n_past
andlast_n_tokens
to the saved value.llama_sample_top_p_top_k
/llama_eval
loop (e.gover the lazy dog
) starting from the saved value ofn_past
andlast_n_tokens
I would now expect to get the same output based on the original prompt e.g.
over the lazy dog
but it seems that the model is not taking this into account and instead I get back a random response.Appreciate any help on this one, cheers.
EDIT: And just to clarify, if I call eval after restoring the kv_cache as I did above it doesn’t seem to reduce processing time.