pytorch-lightning: After resuming traing scheduler.step() will not update optimzer's learning rate
I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I found that the method scheduler.step() will not change the learning rate of optimizer. So I set it manually to avoid this bug.
def on_epoch_start(self) -> None:
self.optimizers().param_groups[0]['lr'] = self.lr_schedulers().get_lr()[0]
cc @borda
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 4
- Comments: 20 (4 by maintainers)
I had the same issue and looked into it a little bit. It turns out that by default self.optimizers() returns from trainer.strategy._lightning_optimizers, and LightningOptimizer maintains a copy of the param_groups field. The parameters all are stored as references to the actual parameters, but the learning rate is not. This behaviour traces back to load_state_dict of the pytorch optimizer, which overwrites the param_groups list with a list from state dict, but it plugs back in the ‘params’ value. So at that point the copy of param_groups maintained by LightningOptimizer is no longer kept up-to-date.
I think a simple solution would be to have the strategy create/update its _lightning_optimizers after a restore from checkpoint. As a user, you can call
self.optimizers(use_pl_optimizer=False).param_groups[0]['lr']instead to fix the issue for now, though I don’t know if not using the LightningOptimizer wrapper will have side effects when using the various training strategies.Little example: After a fit() which restored from a checkpoint it looks like this (with a LR of 1e-4, and a scheduler starting at factor 1e-3):
same issue for me as well
I have checked that schedule and Optimizer have different learning rates.Schedule’s learning rate is correct, but the Optimizer’s learning rate cannot be updated by schedule
---- 回复的原邮件 ---- | 发件人 | Rohit @.> | | 日期 | 2022年04月20日 16:27 | | 收件人 | @.> | | 抄送至 | @.@.> | | 主题 | Re: [PyTorchLightning/pytorch-lightning] After resuming traing scheduler.step() will not update optimzer’s learning rate (Issue #12812) |
did you check the actual learning here?
self.optimizers().param_groups[0][‘lr’]
since while resuming the optimizer’s state is also restored which includes the learning rate as well.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
any updates for this issue?
I ran into this issue as well, and adding this seems to have fixed it for me:
Making the two param_groups point to the same one seemed to resolve the issue. I would appreciate if someone would be able to comment if there are any pitfalls with this.
Same issue here. The ability to manually adjust the learning rate seems pretty key for me, especially for long running jobs.
It seems that in pytorch_lightning.core.optimizer the strategy is passed _optimizer with the correctly loaded learning rate, so training should not be affected by the resume if all changes to the learning rate happen through the scheduler and not manually, but it would be nice to have a fix for this.
#169 pytorch_lightning.core.optimizer step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
Very same issue here.