vllm: torch.cuda.OutOfMemoryError: CUDA out of memory
I am running this code example from hugging face’s TheBloke/zephyr-7B-beta-AWQ
from vllm import LLM, SamplingParams
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
prompts = [
"Tell me about AI",
# "Write a story about llamas",
# "What is 291 - 150?",
# "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''<|system|>
</s>
<|user|>
{prompts}</s>
<|assistant|>
'''
prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="TheBloke/zephyr-7B-beta-AWQ", quantization="awq", dtype="auto", gpu_memory_utilization=0.5)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Unfortunately I get this error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 GiB. GPU 0 has a total capacty of 14.58 GiB of which 9.93 GiB is free. Including non-PyTorch memory, this process has 4.64 GiB memory in use. Of the allocated memory 4.38 GiB is allocated by PyTorch, and 755.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am using a Nvidia T4 - 16Gb memory This the output from nvidia-smi before running the script
and this is the output from nvitop command
I understand the process allocates 4.64 GB for the process. Why do I receive out of memory error?
Thank you anticipated for any help!
About this issue
- Original URL
- State: open
- Created 6 months ago
- Comments: 16 (4 by maintainers)
Running the same code and vllm version 0.2.5 I get a similar error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.50 GiB. GPU 0 has a total capacty of 23.69 GiB of which 3.09 GiB is free. Including non-PyTorch memory, this process has 20.57 GiB memory in use. Of the allocated memory 20.13 GiB is allocated by PyTorch, and 755.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Running nvidia-smi just before:
Seems like something is happening that takes up 20+ memory on the GPU (RTX 3090) but I don’t know how to check that in real time.
When I set
tensor_parallel_size=2
it works but I get this warning:When trying to run vllm as a server with
python -m vllm.entrypoints.api_server --model="TheBloke/zephyr-7B-beta-AWQ" --tensor-parallel-size 2
I get:Sorry if this drifted away from the original question, hope this information can give any clues.