LLaMA-Factory: ValueError: Attempting to unscale FP16 gradients.

I ran this command.

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path openlm-research/open_llama_7b \
    --do_train \
    --dataset train \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

[INFO|training_args.py:1345] 2023-12-07 06:09:02,164 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-07 06:09:02,164 >> PyTorch: setting up devices
[INFO|trainer.py:1760] 2023-12-07 06:09:03,760 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-12-07 06:09:03,761 >>   Num examples = 78,303
[INFO|trainer.py:1762] 2023-12-07 06:09:03,761 >>   Num Epochs = 3
[INFO|trainer.py:1763] 2023-12-07 06:09:03,761 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1766] 2023-12-07 06:09:03,761 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1767] 2023-12-07 06:09:03,761 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1768] 2023-12-07 06:09:03,761 >>   Total optimization steps = 14,682
[INFO|trainer.py:1769] 2023-12-07 06:09:03,762 >>   Number of trainable parameters = 4,194,304
  0%|                                                                                                                                                                                               | 0/14682 [00:00<?, ?it/s][WARNING|logging.py:290] 2023-12-07 06:09:03,766 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 68, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1950, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

我昨天使用時是正常的，但當我今天改變了資料集大小後出現了這個問題，請問是發生了甚麼事呢?

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 19 (2 by maintainers)

Commits related to this issue

use peft 0.7.0, fix #1561 #1764 — committed to hiyouga/LLaMA-Factory by hiyouga 7 months ago
fix #1771 and temporarily fix #1764 — committed to kunshou123/LLaMA-Factory by hiyouga 7 months ago

Most upvoted comments

感谢回答，改成peft==0.6.0就可以运行了

zhuxh529 on Dec 9, 2023

We recommend to use peft==0.6.0

hiyouga on Dec 8, 2023