pytorch-lightning: distributed training: ModelCheckpoint is receiving bad data

You can reproduce in 4 minutes on 0.9.0. I tried master and got an unrelated wandb error and gave up trying to reproduce there.

you must be on a machine with multiple gpus

git clone git@github.com:huggingface/transformers.git
cd transformers
pip install -e .
pip install -e .[examples]  # installs pytorch-lightning==0.8.5
git checkout pl-checkpoint-bug
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz

export MAX_LEN=128
export m=sshleifer/student_marian_en_ro_6_3

python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name $m --model_name_or_path $m \
  --warmup_steps 500 --sortish_sampler --logger_name wandb \
  --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 \
  --model_name_or_path $m --gpus 8 --num_train_epochs=1 \
  --data_dir wmt_mar_pl --output_dir dmar_pl_only_v3 --save_top_k=10

Results

ls dmar_pl_only_v3/*.ckpt

-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:58 dmar_pl_only_v3/val_avg_bleu=23.3951-step_count=5.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:57 dmar_pl_only_v3/val_avg_bleu=23.2619-step_count=4.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.6724-step_count=3.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.2664-step_count=2.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:55 dmar_pl_only_v3/val_avg_bleu=23.2263-step_count=1.ckpt

There are 5 checkpoints which much lower scores. PL thinks the best checkpoint is from step 5, but

cat dmar_pl_only_v3/metrics.json | grep bleu

            "val_avg_bleu": 26.4513,
            "val_avg_bleu": 25.5289,
            "val_avg_bleu": 25.6942,
            "val_avg_bleu": 26.2227,
            "val_avg_bleu": 25.8546,

(the best checkpoint is step 1)

When I evaluate offline on the best checkpoint without truncation, I get val_bleu = 27+, which makes me nearly certain that the numbers in metrics.json (which I create and save in finetune.py are correct and the numbers in the saved paths are incorrect.)

Is this a known issue with a workaround? How can I fix? Should be high priority because suboptimal checkpoint saving is a huge productivity drain.

Additional Notes:

The numbers logged to wandb are also the low/wrong ones.
on 1 or 2 GPU the numbers are identical!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 20 (20 by maintainers)

Most upvoted comments

I can confirm on pytorch-lightning master and transformers master metrics/ckpts are in sync. Modified the cmd slightly by disabling wandb + fixed data_dir path.

python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --fp16 \
  --val_check_interval 0.25 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name $m --model_name_or_path $m \
  --warmup_steps 500 --sortish_sampler  \
  --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 \
  --model_name_or_path $m --gpus 8 --num_train_epochs=1 \
  --data_dir wmt_mar_pl --output_dir dmar_pl_only_v3 --save_top_k=10

~/transformers/examples/seq2seq$ ls dmar_pl_only_v3/*.ckpt
'dmar_pl_only_v3/val_avg_bleu=21.3473-step_count=1.ckpt'
'dmar_pl_only_v3/val_avg_bleu=21.5114-step_count=2.ckpt'
'dmar_pl_only_v3/val_avg_bleu=23.1029-step_count=3.ckpt'
'dmar_pl_only_v3/val_avg_bleu=23.2499-step_count=4.ckpt'

cat dmar_pl_only_v3/metrics.json | grep bleu
            "val_avg_bleu": 21.3473,
            "val_avg_bleu": 21.51145,
            "val_avg_bleu": 23.10295,
            "val_avg_bleu": 23.24995,

When using --do_predict there is an unrelated issue in Transformers which needs to be fixed in finetune.py. I think main gets run again:

Traceback (most recent call last):
  File "/home/jovyan/transformers/examples/seq2seq/finetune.py", line 440, in <module>
    main(args)
  File "/home/jovyan/transformers/examples/seq2seq/finetune.py", line 376, in main
    raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
ValueError: Output directory (dmar_pl_only_v3) already exists and is not empty.

As @Borda said I’ll get a test in place to ensure the file path metrics are correct!

SeanNaren on Oct 7, 2020

@justusschock @ananyahjha93 just wanted to clarify that this has nothing to do with the metric package, since this is calculating blue score using another package.

SkafteNicki on Sep 23, 2020

Where in the PL code does trainer.callback_metrics gather data from all nodes?

sshleifer on Sep 22, 2020

@sshleifer verifying this but you might be right, even I felt in one of my previous training runs that the epoch with minimum validation wasn’t saved.

ananyahjha93 on Sep 22, 2020