vllm: torch.cuda.OutOfMemoryError: CUDA out of memory

I am running this code example from hugging face’s TheBloke/zephyr-7B-beta-AWQ

from vllm import LLM, SamplingParams
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
prompts = [
    "Tell me about AI",
   # "Write a story about llamas",
   # "What is 291 - 150?",
   # "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''<|system|>
</s>
<|user|>
{prompts}</s>
<|assistant|>
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/zephyr-7B-beta-AWQ", quantization="awq", dtype="auto", gpu_memory_utilization=0.5)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Unfortunately I get this error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 GiB. GPU 0 has a total capacty of 14.58 GiB of which 9.93 GiB is free. Including non-PyTorch memory, this process has 4.64 GiB memory in use. Of the allocated memory 4.38 GiB is allocated by PyTorch, and 755.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am using a Nvidia T4 - 16Gb memory This the output from nvidia-smi before running the script

and this is the output from nvitop command

I understand the process allocates 4.64 GB for the process. Why do I receive out of memory error?

Thank you anticipated for any help!

About this issue

Original URL
State: open
Created 6 months ago
Comments: 16 (4 by maintainers)

Most upvoted comments

Running the same code and vllm version 0.2.5 I get a similar error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.50 GiB. GPU 0 has a total capacty of 23.69 GiB of which 3.09 GiB is free. Including non-PyTorch memory, this process has 20.57 GiB memory in use. Of the allocated memory 20.13 GiB is allocated by PyTorch, and 755.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Running nvidia-smi just before:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
| 30%   29C    P8              24W / 350W |     28MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:04:00.0 Off |                  N/A |
| 30%   26C    P8              20W / 350W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off | 00000000:05:00.0 Off |                  N/A |
| 30%   24C    P8              17W / 350W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       737      G   /usr/lib/xorg/Xorg                            9MiB |
|    0   N/A  N/A       885      G   /usr/bin/gnome-shell                          8MiB |
|    1   N/A  N/A       737      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A       737      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

Seems like something is happening that takes up 20+ memory on the GPU (RTX 3090) but I don’t know how to check that in real time.
When I set tensor_parallel_size=2 it works but I get this warning:

WARNING 12-17 10:21:30 config.py:308] Possibly too large swap space. 8.00 GiB out of the 15.51 GiB total CPU memory is allocated for the swap space.
INFO 12-17 10:23:05 llm_engine.py:222] # GPU blocks: 2937, # CPU blocks: 4096

When trying to run vllm as a server with python -m vllm.entrypoints.api_server --model="TheBloke/zephyr-7B-beta-AWQ" --tensor-parallel-size 2 I get:

2023-12-17 10:29:37,045 INFO worker.py:1673 -- Started a local Ray instance.
INFO 12-17 10:29:37 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/zephyr-7B-beta-AWQ', tokenizer='TheBloke/zephyr-7B-beta-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, seed=0)
WARNING 12-17 10:29:37 config.py:308] Possibly too large swap space. 8.00 GiB out of the 15.51 GiB total CPU memory is allocated for the swap space.
INFO 12-17 10:31:05 llm_engine.py:222] # GPU blocks: 0, # CPU blocks: 4096
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/lasse/test/.venv/lib/python3.11/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lasse/test/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lasse/test/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lasse/test/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lasse/test/.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 112, in __init__
    self._init_cache()
  File "/home/lasse/test/.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 226, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Sorry if this drifted away from the original question, hope this information can give any clues.

lasseedfast on Dec 17, 2023