pytorch-lightning: Cant reload from checkpoint when using SWA

🐛 Bug

My model worked just fine until I tried some optimisation using SWA.

from pytorch_lightning.callbacks import  StochasticWeightAveraging

weighting = StochasticWeightAveraging()

The problem is not even clear to understand :

KeyError                                  Traceback (most recent call last)
<ipython-input-20-2d36fa4eaad0> in <module>()
     16 
     17 
---> 18 trainer.fit(module, data_module, ckpt_path="./checkpoints/best-checkpoint.ckpt")
     19 
     20 wandb.finish()

7 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py in load_state_dict(self, state_dict)
    233         """
    234 
--> 235         lr_lambdas = state_dict.pop('lr_lambdas')
    236         self.__dict__.update(state_dict)
    237         # Restore state_dict keys in order to prevent side effects

KeyError: 'lr_lambdas'

To Reproduce

https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report/bug_report_model.ipynb

Expected behavior

Run from checkpoint with SWA.

Environment

CUDA:
- GPU:
  - Tesla V100-SXM2-16GB
- available: True
- version: 11.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0+cu111
- pytorch-lightning: 1.5.9
- tqdm: 4.62.3
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.12
- version: #1 SMP Tue Dec 7 09:58:10 PST 2021

cc @tchaton @rohitgr7 @akihironitta @carmocca

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 18 (10 by maintainers)

Most upvoted comments

For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training

This is correct. Saving and loading is not implemented.

Should I change my scheduler in the plModel from LambdaLR for SWALR?

This is done by the callback automatically.

carmocca on Feb 4, 2022

For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training.

Actually I was going to suggest that but I don’t know what held me 😅 I will keep the issue open for further investigation (it will be helpful if you could mention other members.)

thanks a lot!

ma-batita on Feb 3, 2022

Hi! Can I take this issue?

myxik on Feb 1, 2022