vllm: Running out of memory with TheBloke/CodeLlama-7B-AWQ
Test on llm-vscode-inference-server
I use project llm-vscode-inference-server, which inherits from vllm, to load model weight from CodeLlama-7B-AWQ with command:
python api_server.py --trust-remote-code --model ../CodeLlama-7B-AWQ --quantization awq --dtype half --max-model-len 512
And output:
WARNING 10-26 12:34:54 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:34:54 llm_engine.py:72] Initializing an LLM engine with config: model='../CodeLlama-7B-AWQ', tokenizer='../CodeLlama-7B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:34:54 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Then output after about 5 minutes:
INFO 10-26 12:39:51 llm_engine.py:207] # GPU blocks: 793, # CPU blocks: 512
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 8.49 GiB already allocated; 1.53 GiB free; 8.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I’ve set the PYTORCH_CUDA_ALLOC_CONF
via command before I execute the run command above but still got error:
set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:100
Test on vllm
Simply change command to:
python3 python -m vllm.entrypoints.openai.api_server --model TheBloke/CodeLlama-7B-Python-AWQ --quantization awq -dtype half
if without
-dtype half
it raise error like:ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]
and output:
WARNING 10-26 12:44:31 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:44:31 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-7B-AWQ/', tokenizer='./CodeLlama-7B-AWQ/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:44:31 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
then error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 0; 12.00 GiB total capacity; 4.17 GiB already allocated; 5.94 GiB free; 4.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
System Resources Usage
Before I execute the command, My RTX 3060 VRAM is 1.5/12GB; After executed it raises to 6.0/12GB then throw out error after about 5minutes saying that OutOfMemoryError
.
Question
I just confused that why the AWQ model size is only <=4GB but can not run on the NVIDIA RTX 3060 with 12GB VRAM…
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 20 (2 by maintainers)
When testing on vllm, did you try --max-model-len 512?It looks from your output that it went to 16384.
For those with very limited VRAM try setting the batched tokens to about 4-8k, and combine it with the memory limit parameter to about 0.8.
Also, try the non-quantised version with this first. It seems vLLM uses extra memory to do some kind of operation on the model when loading it quantised.
I use the command as you provided but still cost 21GB VRAM when loading a 7B-AWQ model 😦
@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:
When I reduced the
max-num-batched-tokens
down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.I think I found a potential issue and solution. This is specifically because of vLLM works.
Setting the ‘max_batch_tokens’ (I think is the name) too high causes the KV cache to be too big. It directly influences the GPU memory occupied for some reason. Try setting your max_batch_tokens to like 32k while keeping everything else the same.
This fixed it for me.
Isn’t it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>