vllm: Running out of memory loading 7B AWQ quantized models with 12GB vram

Hi,

i am trying to make use of the AWQ quantization to try to load 7B LLama based models onto my RTX 3060 with 12 GB. This fails OOM for models like https://huggingface.co/TheBloke/leo-hessianai-7B-AWQ . I was able to load https://huggingface.co/TheBloke/tulu-7B-AWQ with its 2k seq length taking up 11.2GB of my ram.

My expectation was that these 7B models with AWQ quantization with GEMM would need for inference around ~ 3.5 gB to load.

I tried to load the models from within my app using vLLM as a lib and following Brokes instructions with

python -m vllm.entrypoints.api_server --model TheBloke/tulu-7B-AWQ --quantization awq

Do I miss something here?

Thx, Manuel

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Reactions: 5
  • Comments: 22 (5 by maintainers)

Most upvoted comments

Possibly shedding some light: I am able to solve the error withing AutoAWQ by setting the fuse_layers parameter to False.

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, safetensors=True, fuse_layers=False)

I tested it for both TheBloke/CodeLlama-7B-AWQ and TheBloke/tulu-7B-AWQ. Below is the full example:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "TheBloke/tulu-7B-AWQ"
quant_file = "model.safetensors"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, safetensors=True, fuse_layers=False)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)

vllm unfortunately does not use AutoAWQForCausalLM under the hood so this cannot be an immediate fix. The issue seemingly lies withing the awq_gemm function located in the vllm/csrc/quantization/awq/gemm_kernels.cu file.

I’m not quite sure how vLLM allocates memory. In AutoAWQ, we only allocate the cache you ask for and it will definitely not take up 11GB VRAM for 512 tokens.

Hello guys,

I was able to load my fine-tuned version of mistral-7b-v0.1-awq quantized with autoawq on my 24Gb TITAN RTX, and it’s using almost 21Gb of the 24Gb. This is huge, because using transformers with autoawq uses 7Gb of my GPU, does someone knows how to reduce it? The “solution” is done by increasing --max-model-len?

Notes, setting:

  • --max-model-len 512 uses 22Gb
  • --max-model-len 4096 uses 21Gb
  • --max-model-len 8192 uses 18Gb
$ CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server --model "./models/mistral-7b-v0.1-awq" --quantization awq --dtype half --max-model-len 4096
INFO 11-16 06:50:58 api_server.py:615] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, model='./models/mistral-7b-v0.1-awq', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 11-16 06:50:58 llm_engine.py:72] Initializing an LLM engine with config: model='./models/mistral-7b-v0.1-awq', tokenizer='./models/mistral-7b-v0.1-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 11-16 06:51:09 llm_engine.py:207] # GPU blocks: 7724, # CPU blocks: 2048
INFO:     Started server process [1027898]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


$ nvidia-smi
Thu Nov 16 06:53:03 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN V                  Off| 00000000:09:00.0 Off |                  N/A |
| 36%   52C    P8               28W / 250W|      0MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX                Off| 00000000:42:00.0 Off |                  N/A |
| 40%   47C    P8               21W / 280W|  20899MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1   N/A  N/A   1027898      C   python3                                   20888MiB |
+---------------------------------------------------------------------------------------+

Yes, i can confirm that with version v0.2.1 trying to load https://huggingface.co/casperhansen/mistral-7b-instruct-v0.1-awq i am still running OOM with

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 GiB (GPU 0; 11.76 GiB total capacity; 4.87 GiB already allocated; 5.42 GiB free; 5.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

i have 24gb of ram using L4 gpu. still getting the same oom error

This is a bit strange… I know, vLLM (or XRay) reserves GPU memory on worm up. but why lerger --max-model-len is efficient?? which parameter is Most efficient for vLLM???

I repro with on my 14Gb Tesla T4

  • max-model-len 2k uses 14Gb
  • max-model-len 4k uses 12Gb
  • max-model-len 8k uses 10Gb
  • max-model-len 16k OoM