vllm: CUDA error: out of memory

I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model=‘/mnt/d/github/text-generation-webui/models/facebook_opt-125m’, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281 Traceback (most recent call last): File “/mnt/d/01Projects/vllm/prac_1.py”, line 11, in <module> llm = LLM(model=“/mnt/d/github/text-generation-webui/models/facebook_opt-125m”) File “/mnt/d/github/vllm/vllm/entrypoints/llm.py”, line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 145, in from_engine_args engine = cls(*engine_configs, distributed_init_method, devices, File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 102, in init self._init_cache() File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 134, in _init_cache self._run_workers(“init_cache_engine”, cache_config=self.cache_config) File “/mnt/d/github/vllm/vllm/engine/llm_engine.py”, line 307, in _run_workers output = executor(*args, **kwargs) File “/mnt/d/github/vllm/vllm/worker/worker.py”, line 126, in init_cache_engine self.cache_engine = CacheEngine( File “/mnt/d/github/vllm/vllm/worker/cache_engine.py”, line 41, in init self.cpu_cache = self.allocate_cpu_cache() File “/mnt/d/github/vllm/vllm/worker/cache_engine.py”, line 89, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Python: 3.10.11 GPU: RTX 3090 24G Linux: WSL2, Ubuntu 20.04.6 LTS Can anyone help to answer this?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 26

Commits related to this issue

Most upvoted comments

With Mistral as well on Ubuntu, I’m getting the CUDA out of memory error, and playing with --gpu-memory-utilization doesn’t seem to make a difference.

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mistral-7B-v0.1 --dtype half --gpu-memory-utilization 0.8  --max-model-len 4096

results in:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 10.73 GiB total capacity; 9.85 GiB already allocated; 46.44 MiB free; 9.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

I tried to follow the AWQ advice above and this works:

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mistral-7B-OpenOrca-AWQ --dtype half --gpu-memory-utilization 0.7  --max-model-len 4096 --quantization awq

(Ubuntu 22.04, RTX 2080 Ti, 32GB RAM)

But I’m not sure how to get the original Mistral running. Anything I’m missing?

After setting --max-model-len 8192 the OOM went away. I also use an AWQ quant with --quantization awq parameter. Works amazing!

@nosolosoft, what if you try using --max-model-len 4096?

Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out pin_memory=True in vLLM code when allocating CPU cache:

https://github.com/vllm-project/vllm/blob/4026a049d3ad510bea8e177bf71722bd510fbb46/vllm/worker/cache_engine.py#L89-L97

pin_memory has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.

@allanwakes @prashantskit have you come across a solution?

I’m seeing same with Mistral 7B on RTX 3090 24GB

Got OOM too as well, on 32GB v100 and using 7B llama model, it shouldn’t OOM, why?

@AlpinDale Good question. You can use the tensor_parallel_size argument for multi-GPU inference. First, initialize your Ray cluster by executing

$ ray start --head

Then, use the tensor_parallel_size argument in the LLM class:

llm = LLM(model=<your model>, tensor_parallel_size=2)  # Inference with 2 GPUs