vllm: Running out of memory with TheBloke/CodeLlama-7B-AWQ

Test on llm-vscode-inference-server

I use project llm-vscode-inference-server, which inherits from vllm, to load model weight from CodeLlama-7B-AWQ with command:

python api_server.py --trust-remote-code --model ../CodeLlama-7B-AWQ --quantization awq --dtype half --max-model-len 512

And output:

WARNING 10-26 12:34:54 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:34:54 llm_engine.py:72] Initializing an LLM engine with config: model='../CodeLlama-7B-AWQ', tokenizer='../CodeLlama-7B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:34:54 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

Then output after about 5 minutes:

INFO 10-26 12:39:51 llm_engine.py:207] # GPU blocks: 793, # CPU blocks: 512
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 8.49 GiB already allocated; 1.53 GiB free; 8.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I’ve set the PYTORCH_CUDA_ALLOC_CONF via command before I execute the run command above but still got error:

 set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:100

Test on vllm

Simply change command to:

python3 python -m vllm.entrypoints.openai.api_server --model TheBloke/CodeLlama-7B-Python-AWQ --quantization awq -dtype half

if without -dtype half it raise error like:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

and output:

WARNING 10-26 12:44:31 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:44:31 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-7B-AWQ/', tokenizer='./CodeLlama-7B-AWQ/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:44:31 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

then error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 0; 12.00 GiB total capacity; 4.17 GiB already allocated; 5.94 GiB free; 4.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Resources Usage

Before I execute the command, My RTX 3060 VRAM is 1.5/12GB; After executed it raises to 6.0/12GB then throw out error after about 5minutes saying that OutOfMemoryError.

Question

I just confused that why the AWQ model size is only <=4GB but can not run on the NVIDIA RTX 3060 with 12GB VRAM…

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 20 (2 by maintainers)

Most upvoted comments

When testing on vllm, did you try --max-model-len 512?It looks from your output that it went to 16384.

amir-in-a-cynch on Oct 26, 2023

For those with very limited VRAM try setting the batched tokens to about 4-8k, and combine it with the memory limit parameter to about 0.8.

Also, try the non-quantised version with this first. It seems vLLM uses extra memory to do some kind of operation on the model when loading it quantised.

slobodaapl on Nov 6, 2023

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:
python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768
When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

I use the command as you provided but still cost 21GB VRAM when loading a 7B-AWQ model 😦

python -m vllm.entrypoints.openai.api_server \
        --model $MODEL_ID \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768 \
        -q awq --dtype half --trust-remote-code

bonuschild on Nov 6, 2023

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:

python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768

When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

slobodaapl on Nov 5, 2023

I think I found a potential issue and solution. This is specifically because of vLLM works.

Setting the ‘max_batch_tokens’ (I think is the name) too high causes the KV cache to be too big. It directly influences the GPU memory occupied for some reason. Try setting your max_batch_tokens to like 32k while keeping everything else the same.

This fixed it for me.

slobodaapl on Nov 5, 2023

        Yes, the `--max-model-len` was passed from CLI but converted into `--max-seq-len` in vLLM api.  You can verify and try on your vLLM. I've try with 512 many times but won't work ...
 
    

---- Replied Message ----



  
    
     From 
    
    
        ***@***.***>
        
    
  
  
    
     Date 
    
    
    10/28/2023 17:23
    
  
  
    
     To 
    
    
     
      
        ***@***.***>
        
      
    
  
  
    
     Cc 
    
    
      
        ***@***.***>
        ,
      
      
        ***@***.***>
        
      
    
  
  
    
     Subject 
    
    
          Re: [vllm-project/vllm] Running out of memory with TheBloke/CodeLlama-7B-AWQ (Issue #1479)

Isn’t it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

bonuschild on Oct 28, 2023