vllm: CUDA error: out of memory

I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model=‘/mnt/d/github/text-generation-webui/models/facebook_opt-125m’, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281 Traceback (most recent call last): File “/mnt/d/01Projects/vllm/prac_1.py”, line 11, in <module> llm = LLM(model=“/mnt/d/github/text-generation-webui/models/facebook_opt-125m”) File “/mnt/d/github/vllm/vllm/entrypoints/llm.py”, line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 145, in from_engine_args engine = cls(*engine_configs, distributed_init_method, devices, File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 102, in init self._init_cache() File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 134, in _init_cache self._run_workers(“init_cache_engine”, cache_config=self.cache_config) File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 307, in _run_workers output = executor(*args, **kwargs) File “/mnt/d/github/vllm/vllm/worker/worker.py”, line 126, in init_cache_engine self.cache_engine = CacheEngine( File “/mnt/d/github/vllm/vllm/worker/cache_engine.py”, line 41, in init self.cpu_cache = self.allocate_cpu_cache() File “/mnt/d/github/vllm/vllm/worker/cache_engine.py”, line 89, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Python: 3.10.11 GPU: RTX 3090 24G Linux: WSL2, Ubuntu 20.04.6 LTS Can anyone help to answer this?

About this issue

Original URL
State: closed
Created a year ago
Comments: 26

Links to this issue

vLLM 0.2.0 released: up to 60% faster, AWQ quant support, RoPe, Mistral-7b support

Commits related to this issue

[ci] Update to use llmval-legacy (#188) as tittle Signed-off-by: rickyyx <rickyx@anyscale.com> — committed to pcmoritz/vllm-public by rickyyx 8 months ago

Most upvoted comments

With Mistral as well on Ubuntu, I’m getting the CUDA out of memory error, and playing with --gpu-memory-utilization doesn’t seem to make a difference.

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mistral-7B-v0.1 --dtype half --gpu-memory-utilization 0.8  --max-model-len 4096

results in:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 10.73 GiB total capacity; 9.85 GiB already allocated; 46.44 MiB free; 9.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

I tried to follow the AWQ advice above and this works:

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mistral-7B-OpenOrca-AWQ --dtype half --gpu-memory-utilization 0.7  --max-model-len 4096 --quantization awq

(Ubuntu 22.04, RTX 2080 Ti, 32GB RAM)

But I’m not sure how to get the original Mistral running. Anything I’m missing?

mendhak on Nov 13, 2023

After setting --max-model-len 8192 the OOM went away. I also use an AWQ quant with --quantization awq parameter. Works amazing!

@nosolosoft, what if you try using --max-model-len 4096?

flexchar on Oct 30, 2023

Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out pin_memory=True in vLLM code when allocating CPU cache:

https://github.com/vllm-project/vllm/blob/4026a049d3ad510bea8e177bf71722bd510fbb46/vllm/worker/cache_engine.py#L89-L97

pin_memory has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.

zhuohan123 on Jun 27, 2023

@allanwakes @prashantskit have you come across a solution?

I’m seeing same with Mistral 7B on RTX 3090 24GB

flexchar on Oct 27, 2023

Got OOM too as well, on 32GB v100 and using 7B llama model, it shouldn’t OOM, why?

lucasjinreal on Jul 12, 2023

@AlpinDale Good question. You can use the tensor_parallel_size argument for multi-GPU inference. First, initialize your Ray cluster by executing

$ ray start --head

Then, use the tensor_parallel_size argument in the LLM class:

llm = LLM(model=<your model>, tensor_parallel_size=2)  # Inference with 2 GPUs

WoosukKwon on Jun 22, 2023