gpt-fast: GPTQ quantization not working

Running quantize.py with --mode int4-gptq does not seem to work:

  • code tries to import lm-evaluation-harness which is not included/documented/used
  • import in eval.py is incorrect, should probably be from model import Transformer as LLaMA instead of from model import LLaMA
  • after fixing two above issues, next one is a circular import
  • after fixing that, import lm_eval should be replaced with import lm_eval.base
  • there is one other circular import
  • there are a few other missing imports from lm_eval
  • and a few other errors

Overall here are the fixes I had to apply to make it run: https://github.com/lopuhin/gpt-fast/commit/86d990bfbce46d10169c8e21e3bfec5cbd203b96

Based on this, could you please check if the right version of the code was included for GPTQ quantization?

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 16 (1 by maintainers)

Most upvoted comments

That looked promising but I unfortunately ran into another issue you probably wouldn’t have. I am on AMD so that might be the cause. I can’t find anything online related to this issue. I noticed that non-GPTQ int4 quantization does not work for me either, with the same error. int8 quantization works fine and I have run GPTQ int4 quantized models using the auto-gptq library for ROCm before so not sure what this issue is.

Traceback (most recent call last):
  File "/home/telnyxuser/gpt-fast/quantize.py", line 614, in <module>
    quantize(args.checkpoint_path, args.model_name, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 560, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 423, in create_quantized_state_dict
    weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
  File "/home/telnyxuser/gpt-fast/quantize.py", line 358, in prepare_int4_weight_and_scales_and_zeros
    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/_ops.py", line 753, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

According to the code here, probably both cuda 12.x and compute capability 8.0+ are required.