speechbrain: [Bug]: Incorrect LR reloading when training is interrupted

Describe the bug

Hello,

I need help! I am currently training speech translation models based on my recipe (https://github.com/speechbrain/speechbrain/blob/develop/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml). Although the dataset is not the same, all relevant model and checkpoint variables are kept the same from that yaml.

The problem I am currently having is that once training is interrupted (i.e. server timeout) and I have to relaunch the model, the adam learning rate recovers from an incredible different value, which ruins my model learning.

Here is an example from my train_log.txt:

epoch: 94, **lr_adam: 3.73e-13**, lr_wav2vec: 1.29e-09 - train loss: 41.66 - valid loss: 85.96, valid ACC: 6.38e-01, valid BLEU: 26.33,
epoch: 95, **lr_adam: 3.73e-13,** lr_wav2vec: 1.16e-09 - train loss: 41.67 - valid loss: 85.96, valid ACC: 6.38e-01, valid BLEU: 26.35

<Model is interrupted before the end of epoch 96 and recovers from an intermediate checkpoint CKPT+2023-01-22+16-41-09+00 (I save one every 15min)>

epoch: 96, **lr_adam: 1.86e-13**, lr_wav2vec: 1.04e-09 - train loss: 41.98 - valid loss: 98.48, valid ACC: 5.53e-01, valid BLEU: 11.66

However, “lr_adam” is saved at the yaml checkpointer:

lr_annealing_adam: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.5
    patient: 2

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        wav2vec2: !ref <wav2vec2>
        lr_annealing_adam: !ref <lr_annealing_adam>
        lr_annealing_wav2vec: !ref <lr_annealing_wav2vec>
        counter: !ref <epoch_counter>

This issue didn’t happen before (six months in the past), and I was able to relaunch my models normally. The first time I noticed this “new” behavior was around one month ago. I installed the latest speech brain developer branch a couple of weeks ago, and the behavior persists.

Thanks a lot!!

Expected behaviour

Learning rate is correctly retrieved by the train function.

To Reproduce

You can use the recipe here: https://github.com/speechbrain/speechbrain/tree/develop/recipes/IWSLT22_lowresource

I am not using the same dataset, but I only changed the data_prepare function, so it shouldn’t matter.

Versions

Speechbrain develop version after padding fix for wav2vec 2.0 fine-tuning: HEAD is now at 0423bda Merge pull request #1805 from Adel-Moumen/1794-bug-m1-gpu-mps-support

Running on NVIDIA A100 80GB, python 3.9.4, torch=1.9.0+cu111

Relevant log output

Above.

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 48 (21 by maintainers)

Most upvoted comments

Hi all. Yes, Ha fix works! Thanks a lot for the sharp eye on this. I’m gonna close this issue, and I’ll be making a PR in the future to fix this on the IWSLT2022 recipe.

Hi, like I said, IWSLT recipe lacks these two optimizers added in init_optimizers(). I tested and it worked fine if we have them there.

I’ll try to come back to you this week still with results for this fix!! Thanks Ha!

Hi,

It seems to me that indeed the learning rates are wrongly initialized!

First of all, I ran the original recipe for some epochs, without corruption, the last two epochs look like this:

epoch: 89, lr_adam: 3.81e-09, new_lr_adam: 3.81e-09, lr_wav2vec: 3.04e-08, new_lr_wav2vec: 3.04e-08 - train loss: 1.19e+02 - valid loss: 2.56e+02, valid ACC: 4.12e-01, valid BLEU: 6.64, valid BLEU_extensive: {'BLEU': 6.6440818048508685, 'BP': 0.640629921999702, 'ratio': 0.6918962800248317, 'hyp_len': 14489, 'ref_len': 20941, 'precisions': [40.734350196700944, 14.134732906751024, 6.437092054917848, 3.1215686274509804], 'bleu_score': 6.6440818048508685}
epoch: 90, lr_adam: 3.81e-09, new_lr_adam: 3.81e-09, lr_wav2vec: 3.04e-08, new_lr_wav2vec: 2.74e-08 - train loss: 1.19e+02 - valid loss: 2.56e+02, valid ACC: 4.12e-01, valid BLEU: 6.64, valid BLEU_extensive: {'BLEU': 6.635718112245129, 'BP': 0.6419713655577016, 'ratio': 0.6928990974643044, 'hyp_len': 14510, 'ref_len': 20946.61, 'precisions': [40.702963473466575, 14.07035175879397, 6.411985018726591, 3.1086054341868294], 'bleu_score': 6.635718112245129}

I then killed the training, removed epoch 90, and reran the training, epoch 90 would look like this: epoch: 90, lr_adam: 3.81e-09, new_lr_adam: 3.81e-09, lr_wav2vec: 3.04e-08, new_lr_wav2vec: 3.04e-08 - train loss: 1.67e+02 - valid loss: 2.53e+02, valid ACC: 3.73e-01, valid BLEU: 4.53, valid BLEU_extensive: {'BLEU': 4.5311322252579895, 'BP': 0.4968432953062323, 'ratio': 0.588415070913519, 'hyp_len': 12322, 'ref_len': 20941, 'precisions': [39.40107125466645, 12.741674474065242, 5.439068100358423, 2.5333207297476132], 'bleu_score': 4.5311322252579895}

So 2 BLEU drops as expected: 6.6 -> 4.5 We can also see that the logged learning rates look correct. But in fact, this is not the case!

In Marcely’s recipe, only the schedulers got saved in the checkpoint. As spotted by @anautsch, these schedulers got recovered without any problem. But this doesn’t affect the actual learning rates that the model is trained from until this method is called: sb.nnet.schedulers.update_learning_rate().

So in fact the learning rates are always initialized with the same values (the ones defined in the yaml).

In order to verify my suspect, I added these lines to the end of the init_optimizers():

sb.nnet.schedulers.update_learning_rate(
    self.wav2vec_optimizer, 3.043252722170457e-08
)
sb.nnet.schedulers.update_learning_rate(
    self.adam_optimizer, 3.814697265625e-09
)

in order to update the learning rates to the new_lr_* of epoch 89, and I got this: epoch: 90, lr_adam: 3.81e-09, new_lr_adam: 3.81e-09, lr_wav2vec: 3.04e-08, new_lr_wav2vec: 2.74e-08 - train loss: 1.18e+02 - valid loss: 2.56e+02, valid ACC: 4.12e-01, valid BLEU: 6.65, valid BLEU_extensive: {'BLEU': 6.652833641777302, 'BP': 0.646243978314534, 'ratio': 0.6960985626283368, 'hyp_len': 14577, 'ref_len': 20941, 'precisions': [40.62564313644783, 14.004001143183768, 6.36600819977637, 3.10113760324139], 'bleu_score': 6.652833641777302}

When comparing Marcely’s recipe to the CV recipe, it lacks one important block in the init_optimizers() where add_recoverable() is used for saving the optimizers to the checkpoint as well https://github.com/speechbrain/speechbrain/blob/41583e09932baa314a92c30f62ed843c4b7b3049/recipes/CommonVoice/ASR/seq2seq/train_with_wav2vec.py#L225

I expect that using the same init_optimizers would solve the problem.

@Adel-Moumen gentle ping after our discussion at ICASSP 😃

Hello Andreas,

Thanks for the reminder, and sorry for the delay answering. I didn’t have the time to test this yet, but I intend to do so in the following days!

It would be great if you could try to reproduce the same behavior using the recipe. It should be straightforward, and the corpus is only 17h, so it should also be pretty fast. Just run it for ~10ish epochs (30min, 1h max on a v100), kill it, remove the last checkpoint, relaunch and check what happens.

If you reproduce the error, then it’s something on SB’s end, if not, then it’s my mess to figure out. It is true that the main difference between the time when I didn’t have this issue and now is my place of employment. So it could still be an environment issue, but I’m unable to check this.

Hi Andreas,

Sorry! My message was misleading. When I say it doesn’t work, I mean that I started training from scratch for 11 epochs, stopped training, removed the checkpoint for epoch 11 (thus restarting from epoch 10) and relaunched the model. I noticed a BLEU degradation from 5.37 to 4.55 in this case. The difference is smaller (compared to the 14 BLEU I lose for some other setups), but i think this is mainly because the BLEU was not high in the first place.

btw - with the PR 1600, I edited your recipe a bit, so we can test it automatically, I hope these changes were ok.

Hi @mzboito

  1. thank you for being that patient in hunting down the error options
  2. wtf

Please note, that there were some changes we needed to implement with the latest pytorch versions, e.g. the aforementioned PR 1683.

With this new environment, does the same issue re-occur when re-loading after a few epochs? (maybe running 2-3 epochs is enough; and then loading the 1st epoch’s checkpoint - I hope that’s not too time/GPU-intensive)

edit: I mean with fresh results folder & fresh pretrained models fetching etc. (just move existing data, no need to delete, I hope)

I feel a bit sorry that I haven’t run a check yet on my side - you gave all information so that we could - there’s a PR on my desk since a few months which I just want to get done with - at least sync it with the latest develop version before the weekend.

@Gastron has a point.

lr_adam has 0.5 annealing factor. 3.73e-13 * .5 ~ 1.86e-13
lr_wav2vec has 0.9 annealing factor: 1.29e-09 * .9 ~ 1.16e-09 & 1.16e-09 * .9 ~ 1.04e-09

also in the train script, as pointed out

        if stage == sb.Stage.VALID and sb.utils.distributed.if_main_process():
            current_epoch = self.hparams.epoch_counter.current
            old_lr_adam, new_lr_adam = self.hparams.lr_annealing_adam(
                stage_stats["BLEU"]
            )
...
                (
                    old_lr_wav2vec,
                    new_lr_wav2vec,
                ) = self.hparams.lr_annealing_wav2vec(stage_stats["BLEU"])
                sb.nnet.schedulers.update_learning_rate(
                    self.wav2vec_optimizer, new_lr_wav2vec
                )
                self.hparams.train_logger.log_stats(
                    stats_meta={
                        "epoch": current_epoch,
                        "lr_adam": old_lr_adam,
                        "lr_wav2vec": old_lr_wav2vec,
                    },
                    train_stats={"loss": self.train_stats},
                    valid_stats=stage_stats,
                )
...
           self.checkpointer.save_and_keep_only(
                meta=meta, name=name, num_to_keep=10, max_keys=["BLEU"]
            )

Looks like expected behaviour. The logging reports on the old LRs that relate to the old/current performance reports & BLEU on valid is a result. Then, after reloading CKPT, the new learning rates are used.

Logging could include old/new learning rates.

As @mzboito reported, this did not happen this way before with other CKPTs. The reason might be as well as @Gastron pointed out indirectly, that in previous experiences there was no such change to be observable. In that regard, a change in BLEU is also what could happen when learning rates in need for a change. Yet, the max_keys=["BLEU"] mechanism should prevent that too low-BLEU CKPTs impact the overall result.

@mzboito please respond, if your recipe still works, and if this issue can be closed.