pytorch-lightning: After resuming traing scheduler.step() will not update optimzer's learning rate

I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I found that the method scheduler.step() will not change the learning rate of optimizer. So I set it manually to avoid this bug.

    def on_epoch_start(self) -> None:
        self.optimizers().param_groups[0]['lr'] = self.lr_schedulers().get_lr()[0]

cc @borda

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 4
Comments: 20 (4 by maintainers)

Most upvoted comments

I had the same issue and looked into it a little bit. It turns out that by default self.optimizers() returns from trainer.strategy._lightning_optimizers, and LightningOptimizer maintains a copy of the param_groups field. The parameters all are stored as references to the actual parameters, but the learning rate is not. This behaviour traces back to load_state_dict of the pytorch optimizer, which overwrites the param_groups list with a list from state dict, but it plugs back in the ‘params’ value. So at that point the copy of param_groups maintained by LightningOptimizer is no longer kept up-to-date.

I think a simple solution would be to have the strategy create/update its _lightning_optimizers after a restore from checkpoint. As a user, you can call self.optimizers(use_pl_optimizer=False).param_groups[0]['lr'] instead to fix the issue for now, though I don’t know if not using the LightningOptimizer wrapper will have side effects when using the various training strategies.

Little example: After a fit() which restored from a checkpoint it looks like this (with a LR of 1e-4, and a scheduler starting at factor 1e-3):

trainer.optimizers[0].param_groups[0]['lr']
Out[36]: 0.00010000000000000009

trainer.strategy._lightning_optimizers[0].param_groups[0]['lr']
Out[37]: 1.0000000000000001e-07

FrankZijlstra on Oct 17, 2022

same issue for me as well

jropen on Feb 1, 2023

I have checked that schedule and Optimizer have different learning rates.Schedule’s learning rate is correct, but the Optimizer’s learning rate cannot be updated by schedule

---- 回复的原邮件 ---- | 发件人 | Rohit @.> | | 日期 | 2022年04月20日 16:27 | | 收件人 | @.> | | 抄送至 | @.@.> | | 主题 | Re: [PyTorchLightning/pytorch-lightning] After resuming traing scheduler.step() will not update optimzer’s learning rate (Issue #12812) |

did you check the actual learning here?

self.optimizers().param_groups[0][‘lr’]

since while resuming the optimizer’s state is also restored which includes the learning rate as well.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

lanslotttTT on Apr 20, 2022

any updates for this issue?

stephenllh on Apr 11, 2023

I ran into this issue as well, and adding this seems to have fixed it for me:

def on_train_start(self):
    self.optimizers().param_groups = self.optimizers()._optimizer.param_groups

Making the two param_groups point to the same one seemed to resolve the issue. I would appreciate if someone would be able to comment if there are any pitfalls with this.

jngiam on Jan 28, 2023

Same issue here. The ability to manually adjust the learning rate seems pretty key for me, especially for long running jobs.

lminer on Oct 25, 2022

It seems that in pytorch_lightning.core.optimizer the strategy is passed _optimizer with the correctly loaded learning rate, so training should not be affected by the resume if all changes to the learning rate happen through the scheduler and not manually, but it would be nice to have a fix for this.

#169 pytorch_lightning.core.optimizer step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)

willi-menapace on Nov 1, 2022

Very same issue here.

NiccoloCavagnero on Jun 14, 2022