ColossalAI: [BUG]: Is it normal to have loss nan after the Stage 1 - Supervised Finetuning?

πŸ› Describe the bug

wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Waiting for W&B process to finish... (success).
wandb: - 0.010 MB of 0.010 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: batch_id β–β–β–‚β–‚β–ƒβ–ƒβ–„β–„β–…β–…β–†β–†β–‡β–‡β–ˆβ–ˆ
wandb:    epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:     loss ▁               
wandb:       lr β–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–„β–ƒβ–ƒβ–‚β–‚β–β–β–
wandb: 
wandb: Run summary:
wandb: batch_id 127
wandb:    epoch 0
wandb:     loss nan
wandb:       lr 0.0
wandb: 
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)

Is it normal to have nan loss?

Environment

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 22 (4 by maintainers)

Most upvoted comments

My experience: model.half() adam(eps=1e-8) loss:nan model.half() sgd loss:normal, however, non convergence model.half() adam(eps=1-4) loss:normal, however, non convergence model.half() fp16 loss:normal, however, non convergence model adam(eps=1e-8) loss:normal, convergence Remove .half() can work. I hope this information is useful.

@JThh Looking forward for a fix regarding this! I suspect something is wrong during the lora parameter optimization.

This seems to have nothing to do with the accumulation of gradients. In the first few steps, the loss is normal, and then it starts to become nan

@Yunnglin It looks like the optimizer won’t work as long as the LoRA parameter is fp16. I hacked into the code and kept the model parameters to fp16 but changed LoRA parameters fp32, then it started to work.

is this warning shown in your log?

``use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False`...
/opt/conda/lib/python3.9/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")`

it might be the cause of this issue, while still not solved