pytorch-lightning: distributed training: ModelCheckpoint is receiving bad data
You can reproduce in 4 minutes on 0.9.0. I tried master and got an unrelated wandb error and gave up trying to reproduce there.
you must be on a machine with multiple gpus
git clone git@github.com:huggingface/transformers.git
cd transformers
pip install -e .
pip install -e .[examples] # installs pytorch-lightning==0.8.5
git checkout pl-checkpoint-bug
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export MAX_LEN=128
export m=sshleifer/student_marian_en_ro_6_3
python finetune.py \
--learning_rate=3e-4 \
--do_train \
--do_predict \
--fp16 \
--val_check_interval 0.25 \
--data_dir wmt_en_ro \
--max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
--freeze_encoder --freeze_embeds \
--train_batch_size=64 --eval_batch_size=64 \
--tokenizer_name $m --model_name_or_path $m \
--warmup_steps 500 --sortish_sampler --logger_name wandb \
--fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 \
--model_name_or_path $m --gpus 8 --num_train_epochs=1 \
--data_dir wmt_mar_pl --output_dir dmar_pl_only_v3 --save_top_k=10
Results
ls dmar_pl_only_v3/*.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:58 dmar_pl_only_v3/val_avg_bleu=23.3951-step_count=5.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:57 dmar_pl_only_v3/val_avg_bleu=23.2619-step_count=4.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.6724-step_count=3.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.2664-step_count=2.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:55 dmar_pl_only_v3/val_avg_bleu=23.2263-step_count=1.ckpt
There are 5 checkpoints which much lower scores. PL thinks the best checkpoint is from step 5, but
cat dmar_pl_only_v3/metrics.json | grep bleu
"val_avg_bleu": 26.4513,
"val_avg_bleu": 25.5289,
"val_avg_bleu": 25.6942,
"val_avg_bleu": 26.2227,
"val_avg_bleu": 25.8546,
(the best checkpoint is step 1)
When I evaluate offline on the best checkpoint without truncation, I get val_bleu = 27+, which makes me nearly certain that the numbers in metrics.json (which I create and save in finetune.py are correct and the numbers in the saved paths are incorrect.)
Is this a known issue with a workaround? How can I fix? Should be high priority because suboptimal checkpoint saving is a huge productivity drain.
Additional Notes:
- The numbers logged to wandb are also the low/wrong ones.
- on 1 or 2 GPU the numbers are identical!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (20 by maintainers)
I can confirm on pytorch-lightning master and transformers master metrics/ckpts are in sync. Modified the cmd slightly by disabling wandb + fixed data_dir path.
When using
--do_predictthere is an unrelated issue in Transformers which needs to be fixed in finetune.py. I think main gets run again:As @Borda said I’ll get a test in place to ensure the file path metrics are correct!
@justusschock @ananyahjha93 just wanted to clarify that this has nothing to do with the metric package, since this is calculating blue score using another package.
Where in the PL code does
trainer.callback_metricsgather data from all nodes?@sshleifer verifying this but you might be right, even I felt in one of my previous training runs that the epoch with minimum validation wasn’t saved.