bitsandbytes: RuntimeError: expected scalar type Half but found Float
I met the following error when I tried to train bloom-7b1-mt with peft LoRA in 8bit+fp16 (torch amp) mode:
Traceback (most recent call last):
File "finetune.py", line 141, in <module>
train(args)
File "finetune.py", line 133, in train
trainer.train()
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 1638, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 1903, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 2660, in training_step
tmp.backward()
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py", line 456, in backward
grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float
which does not appear during training llama-7b with exactly the same settings.
Also, it does not appear if I set fp16=False.
model = AutoModel.from_pretrained(
BLOOM_MODEL_PATH,
trust_remote_code=True,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map='auto',
)
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, lora_config)
...
trainer = transformers.Trainer(
args=transformers.TrainingArguments(
...
fp16=False,
...
),
...
)
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 4
- Comments: 26
Hi!
I’ve solved by adding:
solved
This error occurs when the 2 matrices you are multiplying are not of same dtype.
Half means dtype = torch.float16 while, Float means dtype = torch.float32
to resolve the error simply cast your model weights into float16
I am using TRL’s
SFTTrainerto train OPT (loaded in int8 + LoRA), and seeing the same issue. WhileSolves the precision mismatch issue, it causes the loss to become 0.
I was using peft with whisper model and got this issue.
solved the issue.
I guess, AMP uses lower precision (like float16 also known as half precision) for certain operations which are less sensitive to precision, while keeping others (like the loss calculation) in higher precision (like float32).
torch.autocastcontext manager will enable AMP for the operations inside its block.Changing the param’s datatype manually helped me. Training has started 😃