AutoGPTQ: [BUG]RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Describe the bug try quick start demo code for finetuned qwen model, but RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Hardware details GPU: RTX A40 48GB CPU: 15vCpu Momory: 80GB
Software version Version of relevant software such as operation system, cuda toolkit, python, auto-gptq, pytorch, transformers, accelerate, etc. OS: Ubuntu 22.04.1 LTS cuda: 11.8 python: 3.10.8 auto-gptq 0.4.2 transformers 4.32.0 torch 2.1.0 accelerate 0.23.0
To Reproduce
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "/root/autodl-tmp/newmodel_14b"
quantized_model_dir = "/root/autodl-tmp/newmodel_14b_int4"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, trust_remote_code=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
, return_tensors="pt")
]
quantize_config = BaseQuantizeConfig(
bits=4, # 将模型量化为 4-bit 数值类型
group_size=128, # 一般推荐将此参数的值设置为 128
desc_act=False, # 设为 False 可以显著提升推理速度,但是 ppl 可能会轻微地变差
)
# 加载未量化的模型,默认情况下,模型总是会被加载到 CPU 内存中
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config,
trust_remote_code=True, device_map="cuda:0")
# 量化模型, 样本的数据类型应该为 List[Dict],其中字典的键有且仅有 input_ids 和 attention_mask
model.quantize(examples)
# 使用 safetensors 保存量化好的模型
model.save_quantized(quantized_model_dir, use_safetensors=True)
Expected behavior as demo show result
Screenshots
Additional context log:
Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%
15/15 [00:03<00:00, 4.34it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[1], line 25
21 model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config,
22 trust_remote_code=True, device_map="cuda:0")
24 # 量化模型, 样本的数据类型应该为 List[Dict],其中字典的键有且仅有 input_ids 和 attention_mask
---> 25 model.quantize(examples)
27 # 使用 safetensors 保存量化好的模型
28 model.save_quantized(quantized_model_dir, use_safetensors=True)
File ~/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/miniconda3/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:359, in BaseGPTQForCausalLM.quantize(self, examples, batch_size, use_triton, use_cuda_fp16, autotune_warmup_after_quantized, cache_examples_on_gpu)
357 else:
358 additional_layer_inputs[k] = v
--> 359 layer(layer_input, **additional_layer_inputs)
360 for h in handles:
361 h.remove()
File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/.cache/huggingface/modules/transformers_modules/newmodel_14b/modeling_qwen.py:653, in QWenBlock.forward(self, hidden_states, rotary_pos_emb_list, registered_causal_mask, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
638 def forward(
639 self,
640 hidden_states: Optional[Tuple[torch.FloatTensor]],
(...)
649 output_attentions: Optional[bool] = False,
650 ):
651 layernorm_output = self.ln_1(hidden_states)
--> 653 attn_outputs = self.attn(
654 layernorm_output,
655 rotary_pos_emb_list,
656 registered_causal_mask=registered_causal_mask,
657 layer_past=layer_past,
658 attention_mask=attention_mask,
659 head_mask=head_mask,
660 use_cache=use_cache,
661 output_attentions=output_attentions,
662 )
663 attn_output = attn_outputs[0]
665 outputs = attn_outputs[1:]
File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/.cache/huggingface/modules/transformers_modules/newmodel_14b/modeling_qwen.py:482, in QWenAttention.forward(self, hidden_states, rotary_pos_emb_list, registered_causal_mask, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, use_cache)
480 q_pos_emb, k_pos_emb = rotary_pos_emb
481 # Slice the pos emb for current inference
--> 482 query = apply_rotary_pos_emb(query, q_pos_emb)
483 key = apply_rotary_pos_emb(key, k_pos_emb)
484 else:
File ~/.cache/huggingface/modules/transformers_modules/newmodel_14b/modeling_qwen.py:1410, in apply_rotary_pos_emb(t, freqs)
1408 t_ = t_.float()
1409 t_pass_ = t_pass_.float()
-> 1410 t_ = (t_ * cos) + (_rotate_half(t_) * sin)
1411 return torch.cat((t_, t_pass_), dim=-1).type_as(t)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 15
在 model.quantize 函数中有如下代码:
if isinstance(v, torch.Tensor): one_kwargs[k] = move_to_device(v, self.data_device) else: one_kwargs[k] = v
但是有些模型比如 qwen,qwenblock 在 forward 时接受的其中一个参数是 rotary_pos_emb_list。这个参数是一个列表: List[torch.Tensor]。但是以上的代码并不能把 List[torch.Tensor] 搬到 cuda 上。而计算是在 cuda 上进行的,所以会产生 cuda and cpu 的错误。
一个最简单的 workaround 是定义一个 nested_move_to_device:
def nested_move_to_device(v, device): if isinstance(v, torch.Tensor): return move_to_device(v, device) elif isinstance(v, (list, tuple)): return type(v)([nested_move_to_device(e, device) for e in v]) else: return v
然后替换掉最开始的代码,问题便可迎刃而解。
可以参考这个 repo: https://github.com/wangitu/unpadded-AutoGPTQ
原来基于Qwen-7B-Chat微调训练,量化之后进行推理,应该将demo的pipeline切换为chat,推理表现正常了