transformers: Sharded DDP training fails with seq2seq models

Information

Model I am using (Bert, XLNet …): T5/BART/mBART/Marian

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: seq2seq
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Run

python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
--model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
--src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
--n_train 500 --sharded_ddp

will fail with

Traceback (most recent call last):
File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
main()
File "examples/seq2seq/finetune_trainer.py", line 316, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
self.optimizer.step()
File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
self._broadcast_params()
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
if self.should_bucket_param[param]:
KeyError: Parameter containing:
tensor([[-0.0296,  0.0038],
[ 0.0000,  0.0000],
[ 0.0298,  0.0385],
...,
[-0.0161, -0.0024],
[ 0.0022, -0.0576],
[ 0.0053,  0.0256]], device='cuda:1')
0%|

Using FP16 also fails.

Expected behavior

The script should run to completion.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 46 (21 by maintainers)

Most upvoted comments

Awesome! Thank you, @blefaudeux!

Could you please let me know when it’s on pypi - then I will retest and we will merge the doc PR

cc @msbaines , I don’t have rights on https://pypi.org/project/fairscale/ actually, and I don’t think that it’s automatically tied to our github releases

blefaudeux on Jan 5, 2021

I’m trying to find a better solution to the other issue you were seeing with the bucketing (now fixed on fairscale master for a single node), and it just appeared to me that they could be tied: could it be that the models change devices after the sharded optimizer is built ?

edit: just checked that, not the case, rank/device match at construction time and during the first step

blefaudeux on Dec 17, 2020

https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/training_args.py#L472-L474

stas00 on Dec 17, 2020

thanks for the backtrace ! so, for one the fact that the grads are not all the same is expected with this method, the grads are sharded across the ranks (ie: partitioned), depending on which parameters each rank will optimize. The ShardedGradScaler should be aware of that, and syncs in between the ranks to make sure that they all get the same knowledge, looks like this fails somehow then. Having a quick look right now

(well done for the zombie process destruction ! now somehow if the zombie process is still here the next run “works”)

blefaudeux on Dec 17, 2020

I think the questions you’re asking about are all in this training_step code:

https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/trainer.py#L1126-L1146

I didn’t write it, but from a quick read it appears that it’s a yes to all of your suggestions.

self.use_amp = native amp, use_apex is apex - so we are talking native amp here - that is the branches with use_amp = True

I’ll step through with debugger to see that it is actually so.

stas00 on Dec 17, 2020

The first problem (fp16) is easily fixed, it means that the doc is not good enough. Torch’s grad scaler is not shard aware (the ranks do not have all the gradients with this technique), but you can use this and that should work.

We are using it already: https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/trainer.py#L315

if I print the object just before it fails in self.scaler.step(self.optimizer), I get: <fairscale.optim.grad_scaler.ShardedGradScaler object at 0x7ff27034bac0>

https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/trainer.py#L818

FWIW, I experience the exact same issue with deepspeed if I leave trainer’s --fp16 code - if I remove it and get deepspeed to handle that the failure goes away. So the common denominator is our code.

stas00 on Dec 17, 2020

Thank you so much @blefaudeux and @msbaines for your follow up.

To reproduce:

# setup 
git clone  https://github.com/huggingface/transformers
cd transformers
cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz 
tar -xzvf wmt_en_ro.tar.gz

to reproduce the 2nd failure w/o --fp16:

export BS=4; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python -m torch.distributed.launch --nproc_per_node=2  ./finetune_trainer.py --model_name_or_path  sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp

and then the first one is to just add --fp16

This is a tiny model that is good enough for testing the mechanics, so no good results to be expected. It’s also very quick to download and load. To see real results swap sshleifer/tiny-mbart for sshleifer/distill-mbart-en-ro-12-4.

stas00 on Dec 17, 2020

We initialize the should_bucket_param dictionary when the OSS optimizer is created. The assumption is that parameters should be frozen at this point. Any chance parameters are modified after the optimizer was created?

msbaines on Dec 17, 2020

This is just a brief log of the 2 distinct errors mentioned in OP:

w/ --fp16 the failure is:

  File "./finetune_trainer.py", line 379, in <module>
    main()
  File "./finetune_trainer.py", line 315, in main
    trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 818, in train
    self.scaler.step(self.optimizer)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 330, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.

w/o --fp16 the failure is:

  File "./finetune_trainer.py", line 379, in <module>
    main()
  File "./finetune_trainer.py", line 315, in main
    trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 821, in train
    self.optimizer.step()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/fairscale/optim/oss.py", line 210, in step
    self._broadcast_params()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/fairscale/optim/oss.py", line 522, in _broadcast_params
    if self.should_bucket_param[param]:
KeyError: Parameter containing:
tensor([[ ...]], device='cuda:1')

It’s the very first parameter model.shared.weight in the case of mbart for example.

To test with t5 (same errors), run:

export BS=4; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python -m torch.distributed.launch --nproc_per_node=2  ./finetune_trainer.py --model_name_or_path patrickvonplaten/t5-tiny-random --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler  --task translation_en_XX_to_ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp

stas00 on Dec 16, 2020