vllm: ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.`

I followed the Quickstart tutorial and deployed the Chinese-llama-alpaca-2 model using vllm, and I got the following error. ***@***:~/Code/experiment/***/ToG$ CUDA_VISIBLE_DEVICES=0 python load_llm.py INFO 01-11 15:51:02 llm_engine.py:70] Initializing an LLM engine with config: model='/home/***/***/models/alpaca-2', tokenizer='/home/***/***/models/alpaca-2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) INFO 01-11 15:51:18 llm_engine.py:275] # GPU blocks: 229, # CPU blocks: 512 Traceback (most recent call last): File "load_llm.py", line 8, in <module> llm = LLM(model='/home/***/***/models/alpaca-2') File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 105, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 309, in from_engine_args engine = cls(*engine_configs, File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 114, in __init__ self._init_cache() File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache raise ValueError( ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

my code is:

from vllm import LLM, SamplingParams

prompts = [
    "hello, who is you?",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model='/home/b3432/***/models/alpaca-2')
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Genrate text: {generated_text!r}")

What’s going on and what do I need to do to fix the error? I run the code with RTX3090(24G) * 1. Looking forward to a reply!

About this issue

Original URL
State: open
Created 6 months ago
Comments: 29

Most upvoted comments

Same exception with ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (176). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

Set max_model_len< KV cache. It works.

byerose on Jan 17, 2024

Try to change gpu_memory_utilization=0.95 or 1.0 for vllm. Then it will run successfully.

DsnTgr on Mar 2, 2024