transformers: Sharded DDP training fails with seq2seq models
Information
Model I am using (Bert, XLNet …): T5/BART/mBART/Marian
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: seq2seq
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
Run
python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
--model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
--src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
--n_train 500 --sharded_ddp
will fail with
Traceback (most recent call last):
File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
main()
File "examples/seq2seq/finetune_trainer.py", line 316, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
self.optimizer.step()
File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
self._broadcast_params()
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
if self.should_bucket_param[param]:
KeyError: Parameter containing:
tensor([[-0.0296, 0.0038],
[ 0.0000, 0.0000],
[ 0.0298, 0.0385],
...,
[-0.0161, -0.0024],
[ 0.0022, -0.0576],
[ 0.0053, 0.0256]], device='cuda:1')
0%|
Using FP16 also fails.
Expected behavior
The script should run to completion.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 46 (21 by maintainers)
cc @msbaines , I don’t have rights on https://pypi.org/project/fairscale/ actually, and I don’t think that it’s automatically tied to our github releases
I’m trying to find a better solution to the other issue you were seeing with the bucketing (now fixed on fairscale master for a single node), and it just appeared to me that they could be tied: could it be that the models change devices after the sharded optimizer is built ?
edit: just checked that, not the case, rank/device match at construction time and during the first step
https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/training_args.py#L472-L474
thanks for the backtrace ! so, for one the fact that the grads are not all the same is expected with this method, the grads are sharded across the ranks (ie: partitioned), depending on which parameters each rank will optimize. The ShardedGradScaler should be aware of that, and syncs in between the ranks to make sure that they all get the same knowledge, looks like this fails somehow then. Having a quick look right now
(well done for the zombie process destruction ! now somehow if the zombie process is still here the next run “works”)
I think the questions you’re asking about are all in this
training_stepcode:https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/trainer.py#L1126-L1146
I didn’t write it, but from a quick read it appears that it’s a yes to all of your suggestions.
self.use_amp= native amp,use_apexis apex - so we are talking native amp here - that is the branches withuse_amp = TrueI’ll step through with debugger to see that it is actually so.
We are using it already: https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/trainer.py#L315
if I print the object just before it fails in
self.scaler.step(self.optimizer), I get: <fairscale.optim.grad_scaler.ShardedGradScaler object at 0x7ff27034bac0>https://github.com/huggingface/transformers/blob/dc9f24544291b25b44c9e87239a0ef4355396a4c/src/transformers/trainer.py#L818
FWIW, I experience the exact same issue with deepspeed if I leave trainer’s
--fp16code - if I remove it and get deepspeed to handle that the failure goes away. So the common denominator is our code.Thank you so much @blefaudeux and @msbaines for your follow up.
To reproduce:
to reproduce the 2nd failure w/o
--fp16:and then the first one is to just add
--fp16This is a tiny model that is good enough for testing the mechanics, so no good results to be expected. It’s also very quick to download and load. To see real results swap
sshleifer/tiny-mbartforsshleifer/distill-mbart-en-ro-12-4.We initialize the should_bucket_param dictionary when the OSS optimizer is created. The assumption is that parameters should be frozen at this point. Any chance parameters are modified after the optimizer was created?
This is just a brief log of the 2 distinct errors mentioned in OP:
w/
--fp16the failure is:w/o
--fp16the failure is:It’s the very first parameter
model.shared.weightin the case of mbart for example.To test with t5 (same errors), run: