pytorch-lightning: Cant reload from checkpoint when using SWA

πŸ› Bug

My model worked just fine until I tried some optimisation using SWA.

from pytorch_lightning.callbacks import  StochasticWeightAveraging

weighting = StochasticWeightAveraging()

The problem is not even clear to understand :

KeyError                                  Traceback (most recent call last)
<ipython-input-20-2d36fa4eaad0> in <module>()
     16 
     17 
---> 18 trainer.fit(module, data_module, ckpt_path="./checkpoints/best-checkpoint.ckpt")
     19 
     20 wandb.finish()

7 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py in load_state_dict(self, state_dict)
    233         """
    234 
--> 235         lr_lambdas = state_dict.pop('lr_lambdas')
    236         self.__dict__.update(state_dict)
    237         # Restore state_dict keys in order to prevent side effects

KeyError: 'lr_lambdas'

To Reproduce

https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report/bug_report_model.ipynb

Expected behavior

Run from checkpoint with SWA.

Environment

  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 11.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0+cu111
    • pytorch-lightning: 1.5.9
    • tqdm: 4.62.3
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.12
    • version: #1 SMP Tue Dec 7 09:58:10 PST 2021

cc @tchaton @rohitgr7 @akihironitta @carmocca

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training

This is correct. Saving and loading is not implemented.

Should I change my scheduler in the plModel from LambdaLR for SWALR?

This is done by the callback automatically.

For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training.

Actually I was going to suggest that but I don’t know what held me πŸ˜… I will keep the issue open for further investigation (it will be helpful if you could mention other members.)

thanks a lot!

Hi! Can I take this issue?