llama-cpp-python: Commit `7c898d5` breaks generation on GPU

From commit 7c898d onwards, the output of any type of generation/completion on the GPU is just “#” repeated forever. For instance, using the example from README.md

from llama_cpp import Llama
llm = Llama(model_path='models/llama2-7b.q4_0.gguf', n_gpu_layers=100)
for s in llm('Building a website can be done in 10 simple steps:\nStep 1:', stream=True):
    print(s)

The output is the following repeated:

{'id': 'cmpl-14ed3b80-49af-453d-99a4-c7925f5680f7', 'object': 'text_completion', 'created': 1705351368, 'model': 'models/llama2-7b.q4_0.gguf', 'choices': [{'text': '#', 'index': 0, 'logprobs': None, 'finish_reason': None}]}

Generation works fine on the CPU and for previous commits. Doesn’t seem to be related to quantization or model type. Interestingly, generation also works using pure llama.cpp through the main interface for both CPU and GPU. I tested this out for the current master and the commits around the above change (notably 76484fb and 1d11838). I also managed to get it working in llama-cpp-python using the low level API, just using simple batching and llama_decode.

Environment info:

GPU: RTX A6000
OS: Linux 6.6.0-0.rc5
CUDA SDK: 12.2
CUDA Drivers: 535.113.01

Thanks!

About this issue

Original URL
State: closed
Created 6 months ago
Reactions: 3
Comments: 17 (9 by maintainers)

Most upvoted comments

Same here.

llama_cpp_python: 0.2.29
GPU: NVIDIA 3090
OS: Ubuntu 22.04
CUDA SDK: 12.2
CUDA Drivers: 535.146.02

I am using the fastapi server. I observed that the server could generate meaningful response at first a few short inputs. When I asked it to response to a long input, it repeated # forever. Then I retried with previous short inputs, I got only #.

Downgrade llama_cpp_python to 0.2.28 solves the issue.

markyfsun on Jan 16, 2024

@iamlemec @iactix should be in 0.2.32 let me know if that works! @iamlemec thanks again for all the help identifying this issue!

abetlen on Jan 22, 2024

offload_kqv is now set to True by default starting from version 0.2.30

abetlen on Jan 19, 2024

This in indeed a bug in llama.cpp, but I would strongly recommend enabling offload_kqv by default, as it is in llama.cpp. Even in cases with low VRAM, it is usually better to offload less layers and keep offload_kqv enabled.

slaren on Jan 18, 2024

I think I found the answer! You need to set offload_kqv=True for things to work. The default in the Llama class is False, but the underlying default from llama_context_default_params is True, which explains why it was working with the low level API.

iamlemec on Jan 16, 2024