AutoGPTQ: [BUG] Recent changes increase VRAM consumption
Describe the bug Somewhere between d4011d29c623e739e91b842a87fce62a38c6e538 and b4eda619d0674e9ef009702cbd538836c0861a56 the VRAM usage increased dramatically. See below.
Tomorrow, I will try to find a certain commit that caused this, but now I’m just giving a heads up.
Hardware details GPU: GTX 3060 12GB
Software version OS: Kubuntu 23.04 Python: 3.10.11 CUDA: 11.7 PyTorch: 2.0.0+cu117 transformers: 4.28.1 accelerate: 0.19.0
To Reproduce The code to measure VRAM:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch
from transformers import AutoTokenizer
model_path = "/opt/models/vicuna-13B-1.1-GPTQ-4bit-128g"
model = AutoGPTQForCausalLM.from_quantized(
model_path,
model_basename="vicuna-13B-1.1-GPTQ-4bit-128g.latest",
device="cuda:0",
use_safetensors=True,
use_triton=True,
quantize_config=BaseQuantizeConfig(
bits=4,
group_size=128,
)
)
input_text = "auto_gptq is " * 210
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
tok_result = tokenizer(input_text, return_tensors="pt")
gen_result = model.generate(
input_ids=tok_result["input_ids"].to("cuda:0"),
max_new_tokens=10
)
mem_gb = round(torch.cuda.max_memory_reserved(0) / 1000 / 1000)
print(f"VRAM: {mem_gb}MB")
With the d4011d29c623e739e91b842a87fce62a38c6e538 (older commit) it gives:
VRAM: 9299MB
With the b4eda619d0674e9ef009702cbd538836c0861a56 (newer commit) it gives:
VRAM: 10108MB
800MB increase.
I have GTX 3060 12GB. On the older commit I could use 16B-4bit models with 2000 context tokens (fully on GPU). With the newest changes I can only use 1300 context tokens.
Expected behavior VRAM consumption should not increase by such a large margin.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 21 (16 by maintainers)
I now investigated the commits in between and found out that several commits contributed to the VRAM consumption increase.
Here’s the whole journey from 9299MB to 10108MB (in chronological order, only commits that changed VRAM are listed):
I believe you! 😃 It’s just that when you say “does not make it faster on 3090” it sounds like you are claiming this feature makes no difference on every/any 3090, when I knew from my own testing that it does.
Also it’s a bit confusing if you say text-gen-ui does something, when you’re actually using an old fork of it which has different behaviour to the current implementation 😃
Are you using LaZaa’s old text-gen-ui fork? Because if so, then he mapped quant_attn to inject_fused_attention in AutoGPTQ. So yes, you using that feature with his fork would be testing exactly what I showed a moment ago, with inject_fused_attention=True/False.
Regarding VRAM:
In the past I have recorded extra VRAM usage with inject_fused_attention=True on both CUDA and Triton. I have a spreadsheet showing eg:
I’ve not tested it for a week or two and those figures are for 2000 context, because I was testing max VRAM usage at the time. So it’s possible something has changed, and/or it might be that it only has an effect on longer contexts, not shorter. Not sure on that. But I have definitely seen more VRAM usage from inject_fused_attention.
And in my ‘comprehensive benchmarking’ issue a few weeks ago I had this table:
So at 512 tokens I saw a slight VRAM bump with CUDA + FA, but not any with Triton + FA. But there was a significant VRAM bump from Triton + fused_mlp, as you say.
Yeah with Triton we have four valid permutations of inject_fused_attention and inject_fused_mlp:
With CUDA it’s just inject_fused_attention true/false.
BTW for future I think we should always do benchmark using PanQiWei’s new benchmarking tool, so we know we’re on a level playing field and comparing apples to apples.
When I have some time I’ll repeat some of my benchmarks using that tool. I did try it briefly the other day and confirmed I got the same ballpark as I was reporting before, eg I got 100 tokens/s on 7B on a 4090 with i9-13900K, and around 30 tokens/s with the same GPU on an AMD EPYC CPU (CPU bottlenecked)
Currently it doesn’t support disabling fused attention or fused mlp so I did a quick PR to add that: https://github.com/PanQiWei/AutoGPTQ/pull/134
Ah yeah I need to speak to oobabooga about this.
AutoGPTQ now defaults to inject_fused_attention=True. That’s great for performance, but as discussed above will increase VRAM usage. Which means 30B models on 24GB cards are going to OOM with AutoGPTQ, when they were fine with GPTQ-for-LLaMa
The issue is that text-gen-ui doesn’t provide any options to configure this.
So we need a new option in text-gen-ui to turn off “Fused Attention” so that it passes
inject_fused_attention=False
to AutoGPTQ. Then VRAM usage should be the same as GPTQ-for-LLaMa (hopefully)fused attention brings a significant performance boost on the modern GPUs I’ve tested. Eg on a 4090 with a CPU fast enough not to bottleneck, it increases performance on 7B from 76 tokens/s to 98 tokens/s
On a bottlenecked CPU it increased performance from 22.66 t/s to 27.22 t/s:
For now
inject_fused_attention
only effects llama and gptj type models,inject_fused_mlp
only effects llama models when use triton, both of them can speedup inference after first generation with the cost of more VRAM used. Sorry for current weak documentation, this will be improved soon.