ColossalAI: [BUG]: Is it normal to have loss nan after the Stage 1 - Supervised Finetuning?

🐛 Describe the bug

wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Waiting for W&B process to finish... (success).
wandb: - 0.010 MB of 0.010 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: batch_id ▁▁▂▂▃▃▄▄▅▅▆▆▇▇██
wandb:    epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:     loss ▁               
wandb:       lr ███▇▇▆▆▅▄▃▃▂▂▁▁▁
wandb: 
wandb: Run summary:
wandb: batch_id 127
wandb:    epoch 0
wandb:     loss nan
wandb:       lr 0.0
wandb: 
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)

Is it normal to have nan loss?

Environment

No response

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 22 (4 by maintainers)

Most upvoted comments

My experience: model.half() adam(eps=1e-8) loss:nan model.half() sgd loss:normal, however, non convergence model.half() adam(eps=1-4) loss:normal, however, non convergence model.half() fp16 loss:normal, however, non convergence model adam(eps=1e-8) loss:normal, convergence Remove .half() can work. I hope this information is useful.

zhhao1 on Apr 21, 2023

@JThh Looking forward for a fix regarding this! I suspect something is wrong during the lora parameter optimization.

puyuanOT on Apr 14, 2023

This seems to have nothing to do with the accumulation of gradients. In the first few steps, the loss is normal, and then it starts to become nan

allendred on Apr 7, 2023

@Yunnglin It looks like the optimizer won’t work as long as the LoRA parameter is fp16. I hacked into the code and kept the model parameters to fp16 but changed LoRA parameters fp32, then it started to work.

puyuanOT on Apr 13, 2023

is this warning shown in your log?

``use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False`...
/opt/conda/lib/python3.9/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")`

it might be the cause of this issue, while still not solved

GeraldWu23 on Apr 11, 2023