vllm: CUDA error: out of memory
I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model=‘/mnt/d/github/text-generation-webui/models/facebook_opt-125m’, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281
Traceback (most recent call last):
File “/mnt/d/01Projects/vllm/prac_1.py”, line 11, in <module>
llm = LLM(model=“/mnt/d/github/text-generation-webui/models/facebook_opt-125m”)
File “/mnt/d/github/vllm/vllm/entrypoints/llm.py”, line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 145, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 102, in init
self._init_cache()
File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 134, in _init_cache
self._run_workers(“init_cache_engine”, cache_config=self.cache_config)
File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 307, in _run_workers
output = executor(*args, **kwargs)
File “/mnt/d/github/vllm/vllm/worker/worker.py”, line 126, in init_cache_engine
self.cache_engine = CacheEngine(
File “/mnt/d/github/vllm/vllm/worker/cache_engine.py”, line 41, in init
self.cpu_cache = self.allocate_cpu_cache()
File “/mnt/d/github/vllm/vllm/worker/cache_engine.py”, line 89, in allocate_cpu_cache
key_blocks = torch.empty(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Python: 3.10.11 GPU: RTX 3090 24G Linux: WSL2, Ubuntu 20.04.6 LTS Can anyone help to answer this?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26
Links to this issue
Commits related to this issue
- [ci] Update to use llmval-legacy (#188) as tittle Signed-off-by: rickyyx <rickyx@anyscale.com> — committed to pcmoritz/vllm-public by rickyyx 8 months ago
With Mistral as well on Ubuntu, I’m getting the CUDA out of memory error, and playing with
--gpu-memory-utilization
doesn’t seem to make a difference.results in:
I tried to follow the AWQ advice above and this works:
(Ubuntu 22.04, RTX 2080 Ti, 32GB RAM)
But I’m not sure how to get the original Mistral running. Anything I’m missing?
After setting
--max-model-len 8192
the OOM went away. I also use an AWQ quant with--quantization awq
parameter. Works amazing!@nosolosoft, what if you try using
--max-model-len 4096
?Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out
pin_memory=True
in vLLM code when allocating CPU cache:https://github.com/vllm-project/vllm/blob/4026a049d3ad510bea8e177bf71722bd510fbb46/vllm/worker/cache_engine.py#L89-L97
pin_memory
has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.@allanwakes @prashantskit have you come across a solution?
I’m seeing same with Mistral 7B on RTX 3090 24GB
Got OOM too as well, on 32GB v100 and using 7B llama model, it shouldn’t OOM, why?
@AlpinDale Good question. You can use the
tensor_parallel_size
argument for multi-GPU inference. First, initialize your Ray cluster by executingThen, use the
tensor_parallel_size
argument in the LLM class: