AutoAWQ: Error quantizing Mixtral: IndexError: index 0 is out of bounds for dimension 1 with size 0

I’m using the example provided here

Hardware:

  • 3x A6000 GPU

Model:

  • mistralai/Mixtral-8x7B-Instruct-v0.1

Environment:

Name: autoawq
Version: 0.1.8
Summary: AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.
Home-page: https://github.com/casper-hansen/AutoAWQ
Author: Casper Hansen
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, attributedict, lm-eval, protobuf, sentencepiece, tabulate, texttable, tokenizers, toml, torch, torchvision, transformers
Required-by: 

- `transformers` version: 4.36.2
- Platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.20.2
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed

Error:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.
  warnings.warn("Repo card metadata block was not found. Setting CardData to empty.")
AWQ:   0%|          | 0/32 [00:27<?, ?it/s]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[8], line 2
      1 # Quantize
----> 2 model.quantize(tokenizer, quant_config=quant_config)

File /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/awq/models/base.py:93, in BaseAWQForCausalLM.quantize(self, tokenizer, quant_config, calib_data, split, text_column, duo_scaling, modules_to_not_convert)
     87 self.quant_config: AwqConfig = AwqConfig.from_dict(quant_config)
     89 quantizer = AwqQuantizer(
     90     self, self.model, tokenizer, self.quant_config.w_bit, self.quant_config.q_group_size,
     91     self.quant_config.version, calib_data, split, text_column, duo_scaling, modules_to_not_convert=modules_to_not_convert
     92 )
---> 93 quantizer.quantize()
     95 self.is_quantized = True

File /usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py:112, in AwqQuantizer.quantize(self)
    109 clip_list = append_str_prefix(clip_list, get_op_name(self.model, self.modules[i]) + ".")
    111 # [STEP 4]: Quantize weights
--> 112 self._apply_quant(self.modules[i], named_linears)
    113 clear_memory()

File /usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py:133, in AwqQuantizer._apply_quant(self, module, named_linears)
    130 elif self.version  == 'GEMV':
    131     q_linear_module = WQLinear_GEMV
--> 133 q_linear = q_linear_module.from_linear(
    134     linear=linear_layer,
    135     w_bit=self.w_bit,
    136     group_size=self.group_size,
    137     init_only=False,
    138     scales=scales,
    139     zeros=zeros
    140 )
    142 linear_layer.cpu()
    143 q_linear.to(next(module.parameters()).device)

File /usr/local/lib/python3.10/dist-packages/awq/modules/linear.py:79, in WQLinear_GEMM.from_linear(cls, linear, w_bit, group_size, init_only, scales, zeros)
     77     for i in range(pack_num):
     78         qweight_col = intweight[:, col * pack_num + order_map[i]]
---> 79         qweight[:, col] |= qweight_col << (i * awq_linear.w_bit)
     80 awq_linear.qweight = qweight
     82 zeros = zeros.to(dtype=torch.int32)

IndexError: index 0 is out of bounds for dimension 1 with size 0

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

You can use multiple GPUs. I would recommend installing from main branch and just follow the Mixtral example. In v0.2.0, I will remove the modules_to_not_convert and make it an internal parameter only.

@lonestriker you need to use the mixtral_quant.py script in the examples