AutoGPTQ: [BUG]RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Describe the bug try quick start demo code for finetuned qwen model, but RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Hardware details GPU: RTX A40 48GB CPU: 15vCpu Momory: 80GB

Software version Version of relevant software such as operation system, cuda toolkit, python, auto-gptq, pytorch, transformers, accelerate, etc. OS: Ubuntu 22.04.1 LTS cuda: 11.8 python: 3.10.8 auto-gptq 0.4.2 transformers 4.32.0 torch 2.1.0 accelerate 0.23.0

To Reproduce

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "/root/autodl-tmp/newmodel_14b"
quantized_model_dir = "/root/autodl-tmp/newmodel_14b_int4"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, trust_remote_code=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    , return_tensors="pt")
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # 将模型量化为 4-bit 数值类型
    group_size=128,  # 一般推荐将此参数的值设置为 128
    desc_act=False,  # 设为 False 可以显著提升推理速度，但是 ppl 可能会轻微地变差
)

# 加载未量化的模型，默认情况下，模型总是会被加载到 CPU 内存中
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config,
                                            trust_remote_code=True, device_map="cuda:0")

# 量化模型, 样本的数据类型应该为 List[Dict]，其中字典的键有且仅有 input_ids 和 attention_mask
model.quantize(examples)

# 使用 safetensors 保存量化好的模型
model.save_quantized(quantized_model_dir, use_safetensors=True)

Expected behavior as demo show result

Screenshots

Additional context log:

Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码，尤其如果你在9月25日前已经开始使用Qwen-7B，千万注意不要使用错误代码和模型。
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention

Loading checkpoint shards: 100%
15/15 [00:03<00:00, 4.34it/s]

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 25
     21 model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config,
     22                                             trust_remote_code=True, device_map="cuda:0")
     24 # 量化模型, 样本的数据类型应该为 List[Dict]，其中字典的键有且仅有 input_ids 和 attention_mask
---> 25 model.quantize(examples)
     27 # 使用 safetensors 保存量化好的模型
     28 model.save_quantized(quantized_model_dir, use_safetensors=True)

File ~/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/miniconda3/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:359, in BaseGPTQForCausalLM.quantize(self, examples, batch_size, use_triton, use_cuda_fp16, autotune_warmup_after_quantized, cache_examples_on_gpu)
    357         else:
    358             additional_layer_inputs[k] = v
--> 359     layer(layer_input, **additional_layer_inputs)
    360 for h in handles:
    361     h.remove()

File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/.cache/huggingface/modules/transformers_modules/newmodel_14b/modeling_qwen.py:653, in QWenBlock.forward(self, hidden_states, rotary_pos_emb_list, registered_causal_mask, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
    638 def forward(
    639     self,
    640     hidden_states: Optional[Tuple[torch.FloatTensor]],
   (...)
    649     output_attentions: Optional[bool] = False,
    650 ):
    651     layernorm_output = self.ln_1(hidden_states)
--> 653     attn_outputs = self.attn(
    654         layernorm_output,
    655         rotary_pos_emb_list,
    656         registered_causal_mask=registered_causal_mask,
    657         layer_past=layer_past,
    658         attention_mask=attention_mask,
    659         head_mask=head_mask,
    660         use_cache=use_cache,
    661         output_attentions=output_attentions,
    662     )
    663     attn_output = attn_outputs[0]
    665     outputs = attn_outputs[1:]

File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/.cache/huggingface/modules/transformers_modules/newmodel_14b/modeling_qwen.py:482, in QWenAttention.forward(self, hidden_states, rotary_pos_emb_list, registered_causal_mask, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, use_cache)
    480     q_pos_emb, k_pos_emb = rotary_pos_emb
    481     # Slice the pos emb for current inference
--> 482     query = apply_rotary_pos_emb(query, q_pos_emb)
    483     key = apply_rotary_pos_emb(key, k_pos_emb)
    484 else:

File ~/.cache/huggingface/modules/transformers_modules/newmodel_14b/modeling_qwen.py:1410, in apply_rotary_pos_emb(t, freqs)
   1408 t_ = t_.float()
   1409 t_pass_ = t_pass_.float()
-> 1410 t_ = (t_ * cos) + (_rotate_half(t_) * sin)
   1411 return torch.cat((t_, t_pass_), dim=-1).type_as(t)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 15

Most upvoted comments

在 model.quantize 函数中有如下代码：

if isinstance(v, torch.Tensor): one_kwargs[k] = move_to_device(v, self.data_device) else: one_kwargs[k] = v

但是有些模型比如 qwen，qwenblock 在 forward 时接受的其中一个参数是 rotary_pos_emb_list。这个参数是一个列表: List[torch.Tensor]。但是以上的代码并不能把 List[torch.Tensor] 搬到 cuda 上。而计算是在 cuda 上进行的，所以会产生 cuda and cpu 的错误。

一个最简单的 workaround 是定义一个 nested_move_to_device:

def nested_move_to_device(v, device): if isinstance(v, torch.Tensor): return move_to_device(v, device) elif isinstance(v, (list, tuple)): return type(v)([nested_move_to_device(e, device) for e in v]) else: return v

然后替换掉最开始的代码，问题便可迎刃而解。

可以参考这个 repo: https://github.com/wangitu/unpadded-AutoGPTQ

wangitu on Oct 17, 2023

原来基于Qwen-7B-Chat微调训练，量化之后进行推理，应该将demo的pipeline切换为chat，推理表现正常了

dlutsniper on Oct 21, 2023