GPTQ-for-LLaMa: 4-bit is 10x slower compared to fp16 LLaMa
On my setup the stock 16-bit 7B LLaMa model runs at 0.6s per iteration with a 1x2048 input. The 4-bit quantized model runs at 8.3s per iteration. That makes the 4-bit version 10x slower than the non-quantized model. Is that normal?
Setup is GPTQ-for-LLaMa at 19c0535d792d7e388e7fe799f8cfa350ce74fa9a; RTX 3090; Environment listed below; My code is listed below; 7B model quantized using c4 --wbits 4 --true-sequential --act-order
; Driver Version: 515.86.01; CUDA Version: 11.7
Environment
name: llama
channels:
- pytorch
- nvidia
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- asttokens=2.2.1=pyhd8ed1ab_0
- backcall=0.2.0=pyh9f0ad1d_0
- backports=1.0=pyhd8ed1ab_3
- backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
- blas=1.0=mkl
- brotlipy=0.7.0=py310h7f8727e_1002
- bzip2=1.0.8=h7b6447c_0
- ca-certificates=2022.12.7=ha878542_0
- certifi=2022.12.7=pyhd8ed1ab_0
- cffi=1.15.1=py310h5eee18b_3
- charset-normalizer=2.0.4=pyhd3eb1b0_0
- cryptography=39.0.1=py310h9ce1e76_0
- cuda-cudart=11.7.99=0
- cuda-cupti=11.7.101=0
- cuda-libraries=11.7.1=0
- cuda-nvrtc=11.7.99=0
- cuda-nvtx=11.7.91=0
- cuda-runtime=11.7.1=0
- cudatoolkit-dev=11.7.0=h1de0b5d_6
- debugpy=1.5.1=py310h295c915_0
- decorator=5.1.1=pyhd8ed1ab_0
- entrypoints=0.4=pyhd8ed1ab_0
- executing=1.2.0=pyhd8ed1ab_0
- ffmpeg=4.3=hf484d3e_0
- filelock=3.9.0=py310h06a4308_0
- flit-core=3.8.0=py310h06a4308_0
- freetype=2.12.1=h4a9f257_0
- giflib=5.2.1=h5eee18b_3
- gmp=6.2.1=h295c915_3
- gmpy2=2.1.2=py310heeb90bb_0
- gnutls=3.6.15=he1e5248_0
- idna=3.4=py310h06a4308_0
- intel-openmp=2021.4.0=h06a4308_3561
- ipykernel=6.15.0=pyh210e3f2_0
- ipython=8.11.0=pyh41d4057_0
- jedi=0.18.2=pyhd8ed1ab_0
- jinja2=3.1.2=py310h06a4308_0
- jpeg=9e=h5eee18b_1
- jupyter_client=7.3.4=pyhd8ed1ab_0
- jupyter_core=4.12.0=py310hff52083_0
- lame=3.100=h7b6447c_0
- lcms2=2.12=h3be6417_0
- ld_impl_linux-64=2.38=h1181459_1
- lerc=3.0=h295c915_0
- libcublas=11.10.3.66=0
- libcufft=10.7.2.124=h4fbf590_0
- libcufile=1.6.0.25=0
- libcurand=10.3.2.56=0
- libcusolver=11.4.0.1=0
- libcusparse=11.7.4.91=0
- libdeflate=1.17=h5eee18b_0
- libffi=3.4.2=h6a678d5_6
- libgcc-ng=11.2.0=h1234567_1
- libgomp=11.2.0=h1234567_1
- libiconv=1.16=h7f8727e_2
- libidn2=2.3.2=h7f8727e_0
- libnpp=11.7.4.75=0
- libnvjpeg=11.8.0.2=0
- libpng=1.6.39=h5eee18b_0
- libsodium=1.0.18=h36c2ea0_1
- libstdcxx-ng=11.2.0=h1234567_1
- libtasn1=4.16.0=h27cfd23_0
- libtiff=4.5.0=h6a678d5_2
- libunistring=0.9.10=h27cfd23_0
- libuuid=1.41.5=h5eee18b_0
- libwebp=1.2.4=h11a3e52_1
- libwebp-base=1.2.4=h5eee18b_1
- lz4-c=1.9.4=h6a678d5_0
- markupsafe=2.1.1=py310h7f8727e_0
- matplotlib-inline=0.1.6=pyhd8ed1ab_0
- mkl=2021.4.0=h06a4308_640
- mkl-service=2.4.0=py310h7f8727e_0
- mkl_fft=1.3.1=py310hd6ae3a3_0
- mkl_random=1.2.2=py310h00e6091_0
- mpc=1.1.0=h10f8cd9_1
- mpfr=4.0.2=hb69a4c5_1
- ncurses=6.4=h6a678d5_0
- nest-asyncio=1.5.6=pyhd8ed1ab_0
- nettle=3.7.3=hbbd107a_1
- networkx=2.8.4=py310h06a4308_1
- numpy=1.23.5=py310hd5efca6_0
- numpy-base=1.23.5=py310h8e6c178_0
- openh264=2.1.1=h4ff587b_0
- openssl=1.1.1t=h7f8727e_0
- packaging=23.0=pyhd8ed1ab_0
- parso=0.8.3=pyhd8ed1ab_0
- pexpect=4.8.0=pyh1a96a4e_2
- pickleshare=0.7.5=py_1003
- pillow=9.4.0=py310h6a678d5_0
- pip=23.0.1=py310h06a4308_0
- prompt-toolkit=3.0.38=pyha770c72_0
- prompt_toolkit=3.0.38=hd8ed1ab_0
- ptyprocess=0.7.0=pyhd3deb0d_0
- pure_eval=0.2.2=pyhd8ed1ab_0
- pycparser=2.21=pyhd3eb1b0_0
- pygments=2.14.0=pyhd8ed1ab_0
- pyopenssl=23.0.0=py310h06a4308_0
- pysocks=1.7.1=py310h06a4308_0
- python=3.10.10=h7a1cb2a_2
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python_abi=3.10=2_cp310
- pytorch=2.0.0=py3.10_cuda11.7_cudnn8.5.0_0
- pytorch-cuda=11.7=h778d358_3
- pytorch-mutex=1.0=cuda
- pyzmq=23.2.0=py310h6a678d5_0
- readline=8.2=h5eee18b_0
- requests=2.28.1=py310h06a4308_1
- setuptools=65.6.3=py310h06a4308_0
- six=1.16.0=pyhd3eb1b0_1
- sqlite=3.41.1=h5eee18b_0
- stack_data=0.6.2=pyhd8ed1ab_0
- sympy=1.11.1=py310h06a4308_0
- tk=8.6.12=h1ccaba5_0
- torchaudio=2.0.0=py310_cu117
- torchtriton=2.0.0=py310
- torchvision=0.15.0=py310_cu117
- tornado=6.1=py310h5764c6d_3
- traitlets=5.9.0=pyhd8ed1ab_0
- typing_extensions=4.4.0=py310h06a4308_0
- tzdata=2022g=h04d1e81_0
- urllib3=1.26.14=py310h06a4308_0
- wcwidth=0.2.6=pyhd8ed1ab_0
- wheel=0.38.4=py310h06a4308_0
- xz=5.2.10=h5eee18b_1
- zeromq=4.3.4=h9c3ff4c_1
- zlib=1.2.13=h5eee18b_0
- zstd=1.5.2=ha4553b6_0
- pip:
- accelerate==0.17.1
- aiohttp==3.8.4
- aiosignal==1.3.1
- async-timeout==4.0.2
- attrs==22.2.0
- datasets==2.10.1
- dill==0.3.6
- frozenlist==1.3.3
- fsspec==2023.3.0
- huggingface-hub==0.13.3
- mpmath==1.2.1
- multidict==6.0.4
- multiprocess==0.70.14
- pandas==1.5.3
- psutil==5.9.4
- pyarrow==11.0.0
- pytz==2022.7.1
- pyyaml==6.0
- quant-cuda==0.0.0
- regex==2023.3.22
- responses==0.18.0
- safetensors==0.3.0
- sentencepiece==0.1.97
- tokenizers==0.13.2
- tqdm==4.65.0
- transformers==4.28.0.dev0
- xxhash==3.2.0
- yarl==1.8.2
Code
#!/usr/bin/env python3
import argparse
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from modelutils import find_layers
from quant import make_quant
from tqdm import tqdm
from transformers import AutoTokenizer, LlamaConfig, LlamaForCausalLM
from llama import get_llama
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, help='Path to HuggingFace model')
parser.add_argument('--quant', type=str, help='Path to quantized model')
parser.add_argument('--stride', type=int, default=512, help='Stride for calculating perplexity')
parser.add_argument('--wbits', type=int, default=4, help='Number of bits for weights')
parser.add_argument('--groupsize', type=int, default=-1, help='Groupsize used during quantization')
def main():
global args # Hack for load_quant
args = parser.parse_args()
if not args.quant:
model = get_llama(args.model)
model.eval()
model.to('cuda')
else:
model = load_quant(args.model, args.quant, args.wbits, args.groupsize)
model.eval()
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(args.model)
for dataset in ['wikitext-2', 'ptb', 'c4']:
ppl = calculate_perplexity(model, tokenizer, dataset, max_length=model.seqlen, stride=args.stride)
print(f"{dataset} perplexity: {ppl}")
# NOTE: Have to modify this to work around the usage of `args` in the original...
def load_quant(model, checkpoint, wbits, groupsize):
config = LlamaConfig.from_pretrained(model)
def noop(*args, **kwargs):
pass
torch.nn.init.kaiming_uniform_ = noop
torch.nn.init.uniform_ = noop
torch.nn.init.normal_ = noop
torch.set_default_dtype(torch.half)
transformers.modeling_utils._init_weights = False
torch.set_default_dtype(torch.half)
model = LlamaForCausalLM(config)
torch.set_default_dtype(torch.float)
model = model.eval()
layers = find_layers(model)
for name in ['lm_head']:
if name in layers:
del layers[name]
make_quant(model, layers, wbits, groupsize, faster=False)
del layers
print('Loading model ...')
if checkpoint.endswith('.safetensors'):
from safetensors.torch import load_file as safe_load
model.load_state_dict(safe_load(checkpoint))
else:
model.load_state_dict(torch.load(checkpoint))
model.seqlen = 2048
print('Done.')
return model
def get_dataset(dataset_name: str, tokenizer) -> torch.Tensor:
if dataset_name == "wikitext-2":
test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt").input_ids
elif dataset_name == 'ptb':
test = load_dataset("ptb_text_only", 'penn_treebank', split="validation")
encodings = tokenizer("\n\n".join(test["sentence"]), return_tensors="pt").input_ids
elif dataset_name == 'c4':
# WARNING: Many of the files in the allenai/c4 repo are marked as "Unsafe" by HuggingFace, possibly containing a virus. This particular file is not, and I doubt it's an issue, but worth noting.
test = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
encodings = [tokenizer(x, return_tensors="pt").input_ids for x in test['text'][:1000]]
encodings = torch.cat(encodings, dim=1)
else:
raise ValueError(f"Unknown dataset {dataset_name}")
return encodings
def calculate_perplexity(model, tokenizer, dataset: str, max_length: int, stride: int = 512) -> float:
encodings = get_dataset(dataset, tokenizer)
seq_len = encodings.size(1)
print(f"Sequence length: {seq_len}")
print(f"Max length: {max_length}")
print(f"Stride: {stride}")
nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len - 1, stride)):
end_loc = min(seq_len - 1, begin_loc + max_length)
trg_len = end_loc - prev_end_loc # How many tokens we want to predict
input_ids = encodings[:, begin_loc:end_loc+1].to('cuda') # +1 for the labels
with torch.no_grad():
# Ask the model for logits
outputs = model(input_ids[:, :-1])
# We only want the last trg_len logits
logits = outputs.logits[..., -trg_len:, :].contiguous()
# The last trg_len tokens are the labels
labels = input_ids[:, -trg_len:].contiguous()
# Compute the NLL for this batch using flattened logits and labels
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
nlls.append(loss)
prev_end_loc = end_loc
if end_loc == (seq_len - 1):
break
ppl = torch.exp(torch.stack(nlls).mean())
return ppl
if __name__ == '__main__':
main()
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 27 (12 by maintainers)
I’m working on a kernel implementation in Triton. My hope is to lean on Triton’s ability to optimize to the hardware on the fly, as well as implement the matmul kernel in a more cache optimal way versus the current CUDA kernel.
So far I have a working kernel, though I haven’t fully verified accuracy. It’s performance curve is a lot better. Not as good as FP16 PyTorch yet, but at least in the ballpark now and scales correctly with context length. I’ve included the code below. WIP I still need to more thoroughly evaluate correctness. As of right now I’m seeing an absolute error of
0.0039
on 256x4096 test vectors relative to fp16 simulation.The one major snag I’ve hit is that Triton doesn’t seem to have a way of expanding a tensor. i.e. something similar to PyTorch’s
repeat_interleave
. i.e. I can’t fetch the quantized weights of size[K//8, N]
, and then unpack them in SRAM into[K, N]
. The hack for now is to configure the fetch like a repeat_interleave. But that means we lose a lot of performance since it’s doing 8x the loads compared to optimal. I think the kernel would run faster than PyTorch if I could fix this; bandwidth tends to be the major performance bottleneck.I have the same problem on my RTX 3080, driver 530.41.03, cuda 11.7. The performance of the 4bit quantized models is very slow with large contexts. In fact, it seems that it is (much!) faster to unpack the layer weights on the fly and use standard PyTorch matmul when sufficiently large matrices are involved. Here’s a hackish implementation of QuantLinear that falls back to PyTorch when context size becomes larger:
MasterTaffer/GPTQ-for-LLaMa@b46c976e28af3823ef5a6f780462beaf5ad906ac
On my test setup 4-bit 13B LLaMa generating 20 tokens with 2000 context tokens, inference speed is improved by ~5x or so.
Okay, I’ve publish a more polished version of my Triton kernel: https://github.com/fpgaminer/GPTQ-triton
The README on that repo has more detailed metrics, but the Triton kernel indeed performs 10x faster than the CUDA kernel with context length 2048. In fact, it’s always faster than the CUDA kernel for all context lengths. It’s almost on par with FP16. And memory usage is the same as CUDA.
As for accuracy, it’s exactly as accurate as the CUDA kernel across wikitext2, PTB, and C4 validation sets.
Currently the kernel only supports 4-bits and groupsize -1, and I’ve only tested the 7B LLaMa weights.
I rewrote the current GTPQ kernel to triton using your code. I actually experienced a very high speedup. In addition, it has been changed to support 2,8 bit support and groupsize. A 3-bit kernel seems difficult to implement as triton does not support indexing.
I’m not terribly well versed in PTX so I can’t say for certain.
Each instance of the CUDA kernel only calculates a
1 x 1 x BLOCK_SIZE_K
result, and launches a grid of (K//BLOCK_SIZE_K, N, M). BLOCK_SIZE_K is of a fixed size, regardless of M, N, or K.The Triton kernel calculates a full
BLOCK_SIZE_M x BLOCK_SIZE_N x K
block in each instance.To the best of my understanding: the CUDA kernel does less work per thread and launches more threads; the Triton kernel does more work per thread and launches less threads. The end results is that the CUDA kernel has to re-fetch data more often than the Triton kernel. This is fine when the data involved fits in the L2 cache, but when it doesn’t the Triton kernel dominates. This occurs in all cases where M>1. The Triton kernel also also auto-adapts the block size based on M, N, and K. In all cases the Triton kernel has competitive performance to PyTorch FP16.
The one downside right now is that Triton doesn’t have support for unpacking quantized data, so I have to do some hacks to get it to work. It works fine, but it isn’t getting any of the bandwidth benefits it should. In theory a set of re-written CUDA kernels would handily beat it.
I’ve done some preliminary debugging and found that
test_kernel.py
is able to demonstrate the issue. I adjustB
in the code for each of the runs below, with the values16
,32
, and64
. It looks like the quantized kernels perform faster than FP16 atB=16
and below, but at>=32
they slow down considerably.The bottleneck seems to just be that there’s a ho-jillion CUDA kernels being executed one after another:
each one of those tiny lines is a
launchCudaKernel
call. The average duration is of each kernel launch is ~20us. there’s literally hundreds of thousands of such events in my trace. note that I didn’t see any appreciable time inmemCopyAsync
after initial model load, as I expected. very brief (<<1ms) copies, presumably to send the predicted tokens back to the CPU side or something, but it really seems like the overhead is just the mammothly inefficient way that transformers and PyTorch work by default - requiring thousands and thousands of round-trips that are nearly instantaneous to execute on the GPU’s side.Tomorrow I will see about using PyTorch 2.0’s
jit.compile
or ejecting aTorchScript
orONNX
model or something. What I’ve done in the past is capture and replay entire CUDA graphs, but I couldn’t really do that without carving into thetransformers
library in a pretty invasive way.Btw I’m using nsight-compute
2023.1
for my data capture and analysis, with CUDA 12.1 and nsight-systems 2023.2.1. I built my own PyTorch from source to support this, since I’m on WSL and profiling support there is bleeding edge.@USBhost Are you getting degraded quality of output under triton branch? I am getting both performance regression and massive quality drop-off under triton branch using re-quantized 30b models. Eval score are normal. The output is wildly diverging from cuda branch with same temp/top-p/top_k/etc config. Still trying to isolate issue.
I was able to run my pytorch branch converted model on triton under ooba fine. Tho I had to remove the options that are no longer used for triton.
Almost all inference code is single threaded so it doesn’t matter if you have 16 cores, it will only use 1 per gpu.
Just monitor your cpu usage vs gpu usage. If your cpu (the core that is running python inference) is at 100% and gpu is 25%, the bottleneck is cpu. The gpu is waiting for more work while cpu is maxed out. For ref, 13900k is 2x the single core performance vs 1950x. After oc, likely 2.2x.
@sterlind
--text "I think the meaning of life is"
The tokenizer averages about 2/3rds of a word per token (depending on the words). In MasterTaffer’s latest code (at least, the PR that was merged here), the new code doesn’t kick in until 128 context tokens. Retry with a ~1300 word input and the difference should be immediately obvious even on a smaller 7B model, probably something on the order of 8-10 seconds of “latency” before you start getting output vs nearly instantaneous afterwards. Go up to a 30B model and the difference is enormous.@sterlind What is your cpu? 4090 requires the very least 12th gen intel or zen4 to have the cpu keep up with feeding the cuda cores. 25% gpu utilization is way too low. It should be 80% or higher when generating tokens. To make the result deterministic, try setting a fixed seed to the inference for all tests also.
@sterlind What is meant by context size is the length of the input text, which has been a huge bottle neck up until now. The new optimization, if I understand correctly, only kicks in after there is >128 tokens of context. It should be really obvious, once you test with a few paragraphs for the input.
@MasterTaffer Any chance of a making a pull request? Hackish or not, it’s also ridiculously faster for me. Testing on the a 3090 with the various LLaMA 30B pre-quantized models linked here: https://github.com/oobabooga/text-generation-webui/pull/530 and running through text-generation-webui, the
--wbits 3 --group-size 128
,--wbits 3 --group-size 32
,--wbits 4 --act-order
, and--wbits 4 --group-size 128
quantizations, I see the the same ~5x inference improvement. I’m seeing a VRAM usage increase of around 5%*, so I guess it’ll probably need a toggle.@diegomontoya That’s a followup commit for MasterTaffer’s own https://github.com/MasterTaffer/GPTQ-for-LLaMa/commit/b46c976e28af3823ef5a6f780462beaf5ad906ac, one of the four models I tested (the
--wbits 4 --group-size 128
one) crashed without that change.*edit: Don’t quote me on that though, with some models it looks slightly higher, but for some reason I’m currently able to run a LLaMA 30B
--wbits 4 --true-sequential --act-order
model at full context on my 3090 without OOMing, while the same setup except without these changes would always OOM yesterday. I have no idea why.