gpt-fast: GPTQ quantization not working
Running quantize.py
with --mode int4-gptq
does not seem to work:
- code tries to import
lm-evaluation-harness
which is not included/documented/used - import in
eval.py
is incorrect, should probably befrom model import Transformer as LLaMA
instead offrom model import LLaMA
- after fixing two above issues, next one is a circular import
- after fixing that,
import lm_eval
should be replaced withimport lm_eval.base
- there is one other circular import
- there are a few other missing imports from lm_eval
- and a few other errors
Overall here are the fixes I had to apply to make it run: https://github.com/lopuhin/gpt-fast/commit/86d990bfbce46d10169c8e21e3bfec5cbd203b96
Based on this, could you please check if the right version of the code was included for GPTQ quantization?
About this issue
- Original URL
- State: open
- Created 7 months ago
- Comments: 16 (1 by maintainers)
That looked promising but I unfortunately ran into another issue you probably wouldn’t have. I am on AMD so that might be the cause. I can’t find anything online related to this issue. I noticed that non-GPTQ int4 quantization does not work for me either, with the same error. int8 quantization works fine and I have run GPTQ int4 quantized models using the auto-gptq library for ROCm before so not sure what this issue is.
According to the code here, probably both cuda 12.x and compute capability 8.0+ are required.