transformers: FP16 overflow with GPT-Neo when using sequence lengths of 2048.

Environment info

  • transformers version: 4.5.0.dev0
  • Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.8.0+cu111
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@stas00

Models:

  • GPT-Neo 1.3b

Library:

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
  2. Enable FP16 and set max_length to 2048
  3. Observe that all loses reported are NaN

Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.

When the max_length is shorter (512) this overflow does not occur.

Expected behavior

I expected no overflows.

Aside

I’m reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 62 (39 by maintainers)

Most upvoted comments

In general if you want users to be able to use fp16 mixed precision for fine-tuning and inference you need to pre-train the model using this mode. For some models we find certain workarounds that localize switching to fp32 for specific submodules, that lead to underflow/overflow under fp16, but often users still get NaNs during long training.

Bottom line, if you pre-train in bf16 be prepared to tell users to use fp32 or bf16 in their fine-tuning/inference processes. As the new hardware supporting bf16/tf32 formats emerges (rtx-3090 + a100) this will be come the simple go-to solution in the future.

Now that deepspeed will have a full-fp32 mode this is great.

So to summarize, at this moment with Samyam’s branch if you use:

  • zero2 you just need to do fp16.enable=false in ds config
  • zero3, same as above, plus zero.Init(dtype=torch.float) is needed in modeling_utils.py (instead of just zero.Init()) - I need to think how to make that configurable.

I’m asking Deepspeed devs if they have some ideas on how to overcome this, I will keep you posted if we find a good intermediary solution.

But at the very least we now know why the model fails under fp16.

I wonder if pre-training processes targeted for mixed precision use should have a loss penalty component that forces the model to remain within fp16 dynamic range, both upper and lower.

Yeah the minimal example removes all evaluation. In FP32 it does work though. I tested a checkpoint the other day.

Yes, but large logits are a potential symptom of what’s going on in the network.

I’ve just created a new debug tool that helps diagnosing the activation overflow issue, just waiting for review to complete, but if you want to try it sooner please grab this branch: https://github.com/huggingface/transformers/pull/11274

and add --debug activation_overflow to the training cmd line. It will abort and dump the trace of the last 40 input/outputs of the forward calls preceding the inf/nan encountering, which should hopefully give an indication of where the problem is.

ZeRO3

Using ZerRO3

The issue though is that it’s causing tons of locking- batches randomly take 5min. It also keeps crashing every few hours. So I’m not entirely sure what’s going on. All of the crashes are known ZeRO3 issues so I might go poke them.

Thank you!

It looks like the version of DeepSpeed we are running (0.3.11) prevents us from running that example on our hardware. We are in the process of updating DeepSpeed to a newer version (>0.3.12) so that it is not caught by line 287 of integrations.py.

I’m running this on 24GB rtx-3090 and while it’s not converging it’s not getting NaNs:

git clone https://github.com/huggingface/transformers
cd transformers
git clone finetune-gpt2xl
rm -rf output_dir; PYTHONPATH=src USE_TF=0 deepspeed --num_gpus=1 examples/language-modeling/run_clm.py \
--deepspeed finetune-gpt2xl/ds_config_gptneo.json \
--model_name_or_path EleutherAI/gpt-neo-1.3B \
--train_file finetune-gpt2xl/train.csv \
--validation_file finetune-gpt2xl/validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir output_dir \
--num_train_epochs 1 \
--eval_steps 15 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 1 \
--use_fast_tokenizer False \
--learning_rate 5e-06 \
--warmup_steps 10 --logging_steps 5 --block_size 2048

Hi! As we’re doing a few changes to the implementation to make it cleaner over in https://github.com/huggingface/transformers/pull/10985, we ran a quick training to ensure that the model could still train.

We leveraged @Xirider’s script detailed in https://github.com/Xirider/finetune-gpt2xl in order to fine-tune the 1.3B checkpoint, and we did see a decrease in the loss over this small sample: image

We didn’t investigate further, but this allows to fine-tune the 1.3B variant on a single V100 GPU.

cc @patil-suraj

We are working on producing a minimal example for you currently. After checking our internal documents we realized that 1.3b is bfp16 where as 2.7b is fp32