llama.cpp: parallel/server crashes with: ggml.c:16521: i != GGML_HASHTABLE_FULL when defragmentation is enabled

Context

Using latest 17e98d4c96a583d420f12046bc92102381dbd28e llama.cpp server.

Server started with a llama70b-F16 like model:

server \
 --model model-f16.gguf \
--ctx-size 32768 \
--n-predict 4096 \
--parallel 32 \
--n-gpu-layers 81 \
--batch-size 4096 \
--ubatch-size 256 \
--metrics \
--mg 1 \
--log-format text \
--defrag-thold 0.1

When sending 32 concurrent requests, the server crashes with:

GGML_ASSERT: /llama.cpp/ggml.c:16521: i != GGML_HASHTABLE_FULL

Backend is CUDA, on 2 A100, compute capability 80.

EDIT: The issue is related with defragmentation, quick fix: disable defragmentation

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 28 (3 by maintainers)

Commits related to this issue

Most upvoted comments

A proper fix will take some time, but this should fix it for now.

--- a/llama.cpp
+++ b/llama.cpp
@@ -10666,7 +10666,7 @@ static void llama_kv_cache_defrag_internal(struct llama_context & lctx) {
     // each move requires 6*n_layer tensors (see build_defrag)
     //   - source view, destination view, copy operation
     //   - x2 for keys and values
-    const uint32_t max_moves = LLAMA_MAX_NODES/(6*n_layer);
+    const uint32_t max_moves = (LLAMA_MAX_NODES - 2*n_layer)/(6*n_layer);

     // determine which KV cells to move where
     //

This command triggers the error quite easily:

./parallel -m models/llama-70b-v2/ggml-model-q4_k_s.gguf -np 32 -ngl 99 -b 4096 -ub 256 -c 32768 -ns 1024 -n 300 -cb -dt 0.1

Let me know if does not work and can send you patch to trigger always

Edit: this one also triggers it and uses 13B model:

./parallel -m models/llama-13b-v2/ggml-model-q4_0.gguf -np 32 -ngl 99 -b 4096 -ub 256 -c 32768 -ns 1024 -n 300 -cb -dt 0.1

I think that the issue is that ggml_backend_sched allocates a hash table based on the actual size of the graph used during measure, but llama_kv_cache_defrag_internal uses up to LLAMA_MAX_NODES. Limiting the size of the hash table is important for performance, as clearing a large table has a significant cost.

32 sequences, each with 4096 tokens requires a KV cache of size 32*4096 = 131072 in order to handle the worst case, so this setup should theoretically run out of KV cache slots. Not sure about the error that you get though, but it could be related

Another thing to look into is the hypothesis that we are evaluating batches partially:

https://github.com/ggerganov/llama.cpp/issues/6617#issuecomment-2051618514

I suspect that this logic does not always work as expected because the batch splitting in llama_decode means that a part of the batch may have been evaluated even if an error is returned.

Do you get the error with equal batch and ubatch?

@slaren The temporary fix is behaving well, 32 users on F16 70b with 1024 max context each for 30min. No shifting, no KV cache full. And no crash 😃 I have pushed eedd42e3767efb49cd497cdef3943397b42ee935 in order to retrieve it securely, but I will delete this temp branch once you feel ready to submit the target patch. FYI: 2 A100 average PP=200tk/s per sequence, TG=4,5tk/s per sequence, KV Cache usage ratio=0,78

I will also test flash attention, but this is quite out of the scope of this issue.

Maybe @ggerganov has a suggestion for how to trigger a worst-case defragment in a simple way.

I am sorry but what I meant by “simply command” was something like curl, this is too complicated and it is going to take too much time to reproduce.

I am not sure which model size will cause the issue, but these steps can be done for any model, I am trying to reproduce with a llama 70B F16 in an isolated environment.

Assuming you are on a debian like OS.

  1. You need go and xk6-sse
# May not install the latest
sudo apt install go
# Or
wget https://go.dev/dl/go1.22.2.linux-amd64.tar.gz
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.22.2.linux-amd64.tar.gz

cd examples/server/bench
go install go.k6.io/xk6/cmd/xk6@latest
export PATH=~/go/bin:$PATH
xk6 build master \
    --with github.com/phymbert/xk6-sse
  1. Download the test dataset
cd examples/server/bench
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
  1. Run the server benchmark (it includes --defrag-thold 0.1,)

cd examples/server/bench
pip install -r requirements.txt
mkdir models
LLAMA_SERVER_BIN_PATH=../../../cmake-build-debug/bin/server python bench.py \
    --runner-label local \
    --name local \
    --branch `git rev-parse --abbrev-ref HEAD` \
    --commit `git rev-parse HEAD` \
    --scenario script.js \
    --duration 10m \
    --hf-repo TheBloke/KafkaLM-70B-German-V0.1-GGUF	 \
    --hf-file kafkalm-70b-german-v0.1.Q2_K.gguf \
    --model-path-prefix models \
    --parallel 32 \
    -ngl 81 \
    --batch-size 4096 \
    --ubatch-size 256 \
    --ctx-size 32768 \
    --n-prompts 1000 \
    --max-prompt-tokens 512 \
    --max-tokens 1024

I will update the steps if I miss something here, but this is the general idea.

EDIT: I am not german, but this is the latest model I found on HF 😄

Is there a simple command that I can run to reproduce this issue? It seems that the issue is the size of the KV defrag graph.

It should give you a full call stack automatically if gdb is installed. Otherwise just run gdb --args ./server ... and type bt to get the call stack when it crashes. It will be more accurate with a debug build, but probably even in release will be enough to figure the issue.

The size of a hash table is probably being underestimated. However without the call stack it is hard to figure which one.