pytorch-lightning: `lr_finder` fails when called after training for 1 or more epochs

🐛 Bug

Calling lr_finder on the model after trainer.fit() has been called will fail with:

LR finder stopped early due to diverging loss.
Failed to compute suggesting for `lr`. There might not be enough points.

, even when the default value of min_lr=1e-08 has been changed to 1e-30.

Please reproduce using the BoringModel and post here

Reproduced using a callback: https://colab.research.google.com/drive/1sbOPs8edyFi_idJNnd6gr3etyv7V57YU?usp=sharing
Reproduced with calling Trainer twice: https://colab.research.google.com/drive/1WxUvayBBg_163nu8fjv-jsvPk6pUrSrK?usp=sharing

To Reproduce

Add the following callback (as demonstrated with the BoringModel):

# Call Learning Rate finder after X epochs
class LRFinderXEpoch(Callback):
    def __init__(self, epoch=1):
        super().__init__()
        
        self.epoch = epoch
    
    def on_train_epoch_start(self, trainer, pl_module):
        if trainer.current_epoch == self.epoch:
            print("Calling learning rate finder!")
            trainer.tune(pl_module)
            # trainer.tuner.lr_find(pl_module, min_lr=1e-30)

Expected behavior

Find the best learning rate after a few epochs of training (e.g. when doing Transfer Learning).

Environment

* CUDA:
	- GPU:
		- Tesla T4
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.5
	- pyTorch_debug:     True
	- pyTorch_version:   1.7.0+cu101
	- pytorch-lightning: 1.0.8
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.6.9
	- version:           #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

Issue came from the following discussion: https://forums.pytorchlightning.ai/t/train-2-epochs-head-unfreeze-learning-rate-finder-continue-training-fit-one-cycle/366/4

Potentially related issues:

About this issue

Original URL
State: open
Created 4 years ago
Comments: 15 (6 by maintainers)

Most upvoted comments

For my project, I found that setting early_stop_threshold=None fixes this. Thanks to @NumesSanguis for the suggestion.

+10

jbohnslav on Dec 10, 2020

The early stopping threshold of the Learning Rate finder seems to be too small. I regularly observe https://github.com/PyTorchLightning/pytorch-lightning/blob/f2fa3c82567ade0aa5bea3aa298542fc5040e4b7/pytorch_lightning/tuner/lr_finder.py#L419 to trigger on a standard image classification task after the first batch.

maxjeblick on Nov 25, 2020