vllm: Running out of memory loading 7B AWQ quantized models with 12GB vram
Hi,
i am trying to make use of the AWQ quantization to try to load 7B LLama based models onto my RTX 3060 with 12 GB. This fails OOM for models like https://huggingface.co/TheBloke/leo-hessianai-7B-AWQ . I was able to load https://huggingface.co/TheBloke/tulu-7B-AWQ with its 2k seq length taking up 11.2GB of my ram.
My expectation was that these 7B models with AWQ quantization with GEMM would need for inference around ~ 3.5 gB to load.
I tried to load the models from within my app using vLLM as a lib and following Brokes instructions with
python -m vllm.entrypoints.api_server --model TheBloke/tulu-7B-AWQ --quantization awq
Do I miss something here?
Thx, Manuel
About this issue
- Original URL
- State: open
- Created 9 months ago
- Reactions: 5
- Comments: 22 (5 by maintainers)
Possibly shedding some light: I am able to solve the error withing AutoAWQ by setting the fuse_layers parameter to False.
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, safetensors=True, fuse_layers=False)
I tested it for both TheBloke/CodeLlama-7B-AWQ and TheBloke/tulu-7B-AWQ. Below is the full example:
vllm unfortunately does not use AutoAWQForCausalLM under the hood so this cannot be an immediate fix. The issue seemingly lies withing the awq_gemm function located in the vllm/csrc/quantization/awq/gemm_kernels.cu file.
I’m not quite sure how vLLM allocates memory. In AutoAWQ, we only allocate the cache you ask for and it will definitely not take up 11GB VRAM for 512 tokens.
Hello guys,
I was able to load my fine-tuned version of
mistral-7b-v0.1-awq
quantized withautoawq
on my 24Gb TITAN RTX, and it’s using almost 21Gb of the 24Gb. This is huge, because usingtransformers
withautoawq
uses 7Gb of my GPU, does someone knows how to reduce it? The “solution” is done by increasing--max-model-len
?Notes, setting:
--max-model-len 512
uses22Gb
--max-model-len 4096
uses21Gb
--max-model-len 8192
uses18Gb
Yes, i can confirm that with version v0.2.1 trying to load https://huggingface.co/casperhansen/mistral-7b-instruct-v0.1-awq i am still running OOM with
i have 24gb of ram using L4 gpu. still getting the same oom error
This is a bit strange… I know, vLLM (or XRay) reserves GPU memory on worm up. but why lerger --max-model-len is efficient?? which parameter is Most efficient for vLLM???
I repro with on my 14Gb Tesla T4