AutoGPTQ: [BUG] Recent changes increase VRAM consumption

Describe the bug Somewhere between d4011d29c623e739e91b842a87fce62a38c6e538 and b4eda619d0674e9ef009702cbd538836c0861a56 the VRAM usage increased dramatically. See below.

Tomorrow, I will try to find a certain commit that caused this, but now I’m just giving a heads up.

Hardware details GPU: GTX 3060 12GB

Software version OS: Kubuntu 23.04 Python: 3.10.11 CUDA: 11.7 PyTorch: 2.0.0+cu117 transformers: 4.28.1 accelerate: 0.19.0

To Reproduce The code to measure VRAM:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch
from transformers import AutoTokenizer

model_path = "/opt/models/vicuna-13B-1.1-GPTQ-4bit-128g"

model = AutoGPTQForCausalLM.from_quantized(
    model_path,
    model_basename="vicuna-13B-1.1-GPTQ-4bit-128g.latest",
    device="cuda:0",
    use_safetensors=True,
    use_triton=True,
    quantize_config=BaseQuantizeConfig(
        bits=4,
        group_size=128,
    )
)

input_text = "auto_gptq is " * 210
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
tok_result = tokenizer(input_text, return_tensors="pt")
gen_result = model.generate(
    input_ids=tok_result["input_ids"].to("cuda:0"),
    max_new_tokens=10
)

mem_gb = round(torch.cuda.max_memory_reserved(0) / 1000 / 1000)
print(f"VRAM: {mem_gb}MB")

With the d4011d29c623e739e91b842a87fce62a38c6e538 (older commit) it gives:

VRAM: 9299MB

With the b4eda619d0674e9ef009702cbd538836c0861a56 (newer commit) it gives:

VRAM: 10108MB

800MB increase.

I have GTX 3060 12GB. On the older commit I could use 16B-4bit models with 2000 context tokens (fully on GPU). With the newest changes I can only use 1300 context tokens.

Expected behavior VRAM consumption should not increase by such a large margin.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 21 (16 by maintainers)

Most upvoted comments

I now investigated the commits in between and found out that several commits contributed to the VRAM consumption increase.

Here’s the whole journey from 9299MB to 10108MB (in chronological order, only commits that changed VRAM are listed):

  • c6395936 - 9299
  • 6476ee42 - 9974
  • 191da814 - 11035
  • e2e7809a - 11169
  • 10347fdd - 10108

I have a fork of textgen since forever where I have these parameters and the FP16 disabling too. I’m not sure what happens on runpod, but I have Xeon v4 and do not get the extra 4 it/s from fused mlp. I also don’t get anything extra on my Zen 1 1700x and P6000.

Not sure why you don’t believe me, I tested it same as you on HW I have.

I believe you! 😃 It’s just that when you say “does not make it faster on 3090” it sounds like you are claiming this feature makes no difference on every/any 3090, when I knew from my own testing that it does.

Also it’s a bit confusing if you say text-gen-ui does something, when you’re actually using an old fork of it which has different behaviour to the current implementation 😃

inject_fused_attention is quant_attn… and that does give a speedup… which seems tied to memory bandwith or proc as you notice on runpod. But quant_attn doesn’t make the model use more memory, in fact it uses LESS. I can squeeze a handful of more tokens when having this on and have slight speed increase as I mentioned before.

Are you using LaZaa’s old text-gen-ui fork? Because if so, then he mapped quant_attn to inject_fused_attention in AutoGPTQ. So yes, you using that feature with his fork would be testing exactly what I showed a moment ago, with inject_fused_attention=True/False.

Regarding VRAM:

In the past I have recorded extra VRAM usage with inject_fused_attention=True on both CUDA and Triton. I have a spreadsheet showing eg: image

I’ve not tested it for a week or two and those figures are for 2000 context, because I was testing max VRAM usage at the time. So it’s possible something has changed, and/or it might be that it only has an effect on longer contexts, not shorter. Not sure on that. But I have definitely seen more VRAM usage from inject_fused_attention.

And in my ‘comprehensive benchmarking’ issue a few weeks ago I had this table:

image

So at 512 tokens I saw a slight VRAM bump with CUDA + FA, but not any with Triton + FA. But there was a significant VRAM bump from Triton + fused_mlp, as you say.

The culprit is inject_fused_mlp that eats up ram when I use it with triton. I also thought both were able to be enabled with triton and used together. OP is using triton. On the alpaca_lora_4bit/autograd implementation they are also separate functions.

Yeah with Triton we have four valid permutations of inject_fused_attention and inject_fused_mlp:

  • no FA / no MLP
  • FA on / MLP off
  • MLP on / FA off
  • both on

With CUDA it’s just inject_fused_attention true/false.

BTW for future I think we should always do benchmark using PanQiWei’s new benchmarking tool, so we know we’re on a level playing field and comparing apples to apples.

When I have some time I’ll repeat some of my benchmarks using that tool. I did try it briefly the other day and confirmed I got the same ballpark as I was reporting before, eg I got 100 tokens/s on 7B on a 4090 with i9-13900K, and around 30 tokens/s with the same GPU on an AMD EPYC CPU (CPU bottlenecked)

Currently it doesn’t support disabling fused attention or fused mlp so I did a quick PR to add that: https://github.com/PanQiWei/AutoGPTQ/pull/134

Ah yeah I need to speak to oobabooga about this.

AutoGPTQ now defaults to inject_fused_attention=True. That’s great for performance, but as discussed above will increase VRAM usage. Which means 30B models on 24GB cards are going to OOM with AutoGPTQ, when they were fine with GPTQ-for-LLaMa

The issue is that text-gen-ui doesn’t provide any options to configure this.

So we need a new option in text-gen-ui to turn off “Fused Attention” so that it passes inject_fused_attention=False to AutoGPTQ. Then VRAM usage should be the same as GPTQ-for-LLaMa (hopefully)

fused attention brings a significant performance boost on the modern GPUs I’ve tested. Eg on a 4090 with a CPU fast enough not to bottleneck, it increases performance on 7B from 76 tokens/s to 98 tokens/s

image

On a bottlenecked CPU it increased performance from 22.66 t/s to 27.22 t/s: image

I have no idea what these params mean (they don’t seem to be documented).

For now inject_fused_attention only effects llama and gptj type models, inject_fused_mlp only effects llama models when use triton, both of them can speedup inference after first generation with the cost of more VRAM used. Sorry for current weak documentation, this will be improved soon.