llama.cpp: Longer and infinite output

If we use -n 1000000 to have a very long output (for a story for example), it stops generating quite fast, after around 30 lines, probably because of this line of code.

It would be nice if we could have longer outputs and also the possibility to have infinite output, stopping only on Ctrl-C. We could maybe specify that -n 0 will trigger that infinite output mode. That issue is a bit related to issue #23

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 14
  • Comments: 57 (34 by maintainers)

Commits related to this issue

Most upvoted comments

Infinite generation should now be supported. The current implementation works like this:

  • Keep generating until the context n_ctx (i.e. 2048) becomes full
  • When full, set n_past == n_keep where n_keep is a user-provided parameter. By default, it is 0. Can be set to something in order to get a “static” prompt. See examples/chat.sh how we make Bob instructions to be a “static” prompt. You can observe the “static” prompt by adding the --verbose-prompt argument
  • These n_keep tokens are instantly available thanks to the KV cache, so no need to recompute anything so far
  • Next, we also pick half of the n_ctx - n_keep tokens that were generated last and insert them for re-inference. This is currently happening serially so there is a delay when it occurs

For example, this command should generate text forever:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 512 -t 8 -n -1 --ignore-eos 

Hmm, I think yes - we need to shift the KV cache. I haven’t implemented this yet in any of the ggml examples. And when the context is full - stop increasing n_past.

@tjohnman

make clean
LLAMA_OPENBLAS=1 make

Yeah, the context swap is not free and causes a pause for me when it happens. Still beats the former behavior and lack thereof. Long starting prompts and large contexts really starts to creep on speed. I never thought I’d want to upgrade my CPU and RAM for CPU-based inference. Really look forward to breakthroughs and “OH” moments. Think a useful strategy for now is to really condense starting prompts so you cover as much as possible with less tokens.

Can confirm that on the latest commit https://github.com/ggerganov/llama.cpp/commit/b391579db92f095666be1d979899b54ae0981573

the infinite generation mode with main -m ./models/llama-13B-ggml/ggml-model-gptq4.bin -n -1 -c 2024 --ignore-eos --n_parts 1 --interactive-first works. Also for the first time since the tokenizer change I’m able to run to it indefinitely without any crashes so it seems that the segfault problem has also been fixed recently.

However as my CPU is pretty underwhelming (i5-6600k 4c/4t) , after ~1800 words the generation slows to such a crawl that not a single token gets generated in several minutes. Note that I said words because there isn’t a log on how many tokens were generated, I’m thinking of adding an option where interjecting with CTRL+C would print out debug info like the amount of tokens generated so far as with my current hardware I’m not able to do stuff like attach a debugger and run traces as the performance penalty from debugging tools is too big to be able to do anything useful.

This PR

also looks very interesting for the purposes of debugging. Especially the link between output length and speed degradation could be researched to understand the cause of the bottleneck better.

Please correct me if I’m wrong since I’m yet to fully grasp LLM’s in general and I’m still very much in the learning phase.

If I understood correctly, basically the general idea of these models is to infer the next token by answering this question: “Given the previous token stream X, what should be the next token?” using probability analysis.

And that includes the end-of-stream token, correct? So when the calculation ends up predicting “stop here”, it stops there.

Currently there is the option to ignore that token for the whole session since this commit https://github.com/ggerganov/llama.cpp/commit/50fae10d0339f2bd639f69dd679c0201d939a265

What is your opinion on including “magic keywords” for the interactive mode to control the variables. That could be extended to other variables too, but what I’m specifically thinking about here is having a “continue” magic keyword. So in the normal operation mode (not overridden by --ignore-eos), after reaching the end-of-stream token and control being given back to the user, inputting “continue” wouldn’t put that in the input stream but rather skip the eos token and continue from there, once. It’d be more flexible than having a global variable for the whole session.

While I can’t be obviously certain, it seems to me that this is how openai’s interactive chatgpt demo does it.

This would only solve the “reached end-of-stream token too early” problem though while the other half is the need to have a sliding context window to be able to have a infinite output.

Again, please feel free to correct me if I understood something wrong.

it seems there are two possible solutions

swap idea: while using the model take half of the input tokens and start building inference over them in the background. When out of context size swap two the second instance. The model will loose all memory of before the new context window. This -: this is computationally intensive +: I am confident that this is technically possible

rolling context the problem here is positional encoding. the original positional encoding for transformers is not abel to do rolling context because the transformer is trained on fixed encodings. Something like the Alibi encoding might work for this. We cant choose the encoding because facebook chose Rope as mentioned during training. Rotational positional encoding sounds promising but after I skimmed over the paper it doesn’t seem probable to me that RopE would support somehing like that. +: would be highly elegant. +: as mentioned the model could carry over information from past the context window. -: It seems unlikely to me that its possible (im not confident in that prediction)

Depending on how much memory you have you can increase the context size to get longer outputs. On a 64gb machine I was able to have a 12k context with the 7B model and 2k context with the 65B model. You can change it here