transformers: FP16 overflow with GPT-Neo when using sequence lengths of 2048.
Environment info
transformers
version: 4.5.0.dev0- Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
- Python version: 3.8.5
- PyTorch version (GPU?): 1.8.0+cu111
- Tensorflow version (GPU?): N/A
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
Models:
- GPT-Neo 1.3b
Library:
- deepspeed: @stas00
Information
Model I am using (Bert, XLNet …):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
- Enable FP16 and set max_length to 2048
- Observe that all loses reported are NaN
Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.
When the max_length is shorter (512) this overflow does not occur.
Expected behavior
I expected no overflows.
Aside
I’m reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 62 (39 by maintainers)
In general if you want users to be able to use fp16 mixed precision for fine-tuning and inference you need to pre-train the model using this mode. For some models we find certain workarounds that localize switching to fp32 for specific submodules, that lead to underflow/overflow under fp16, but often users still get NaNs during long training.
Bottom line, if you pre-train in bf16 be prepared to tell users to use fp32 or bf16 in their fine-tuning/inference processes. As the new hardware supporting bf16/tf32 formats emerges (rtx-3090 + a100) this will be come the simple go-to solution in the future.
Now that deepspeed will have a full-fp32 mode this is great.
So to summarize, at this moment with Samyam’s branch if you use:
fp16.enable=false
in ds configzero.Init(dtype=torch.float)
is needed inmodeling_utils.py
(instead of justzero.Init()
) - I need to think how to make that configurable.I’m asking Deepspeed devs if they have some ideas on how to overcome this, I will keep you posted if we find a good intermediary solution.
But at the very least we now know why the model fails under fp16.
I wonder if pre-training processes targeted for mixed precision use should have a loss penalty component that forces the model to remain within fp16 dynamic range, both upper and lower.
Yeah the minimal example removes all evaluation. In FP32 it does work though. I tested a checkpoint the other day.
Yes, but large logits are a potential symptom of what’s going on in the network.
I’ve just created a new debug tool that helps diagnosing the activation overflow issue, just waiting for review to complete, but if you want to try it sooner please grab this branch: https://github.com/huggingface/transformers/pull/11274
and add
--debug activation_overflow
to the training cmd line. It will abort and dump the trace of the last 40 input/outputs of the forward calls preceding the inf/nan encountering, which should hopefully give an indication of where the problem is.ZeRO3
Using ZerRO3
The issue though is that it’s causing tons of locking- batches randomly take 5min. It also keeps crashing every few hours. So I’m not entirely sure what’s going on. All of the crashes are known ZeRO3 issues so I might go poke them.
Thank you!
It looks like the version of DeepSpeed we are running (0.3.11) prevents us from running that example on our hardware. We are in the process of updating DeepSpeed to a newer version (>0.3.12) so that it is not caught by line 287 of
integrations.py
.I’m running this on 24GB rtx-3090 and while it’s not converging it’s not getting NaNs:
Hi! As we’re doing a few changes to the implementation to make it cleaner over in https://github.com/huggingface/transformers/pull/10985, we ran a quick training to ensure that the model could still train.
We leveraged @Xirider’s script detailed in https://github.com/Xirider/finetune-gpt2xl in order to fine-tune the 1.3B checkpoint, and we did see a decrease in the loss over this small sample:
We didn’t investigate further, but this allows to fine-tune the 1.3B variant on a single V100 GPU.
cc @patil-suraj
We are working on producing a minimal example for you currently. After checking our internal documents we realized that 1.3b is bfp16 where as 2.7b is fp32