pytorch-lr-finder: LR Finder doesn't restore original model weights?

Hey! I love this repo, thanks for making it 💯

Everything works well except for one thing, after some digging around/experimenting, here’s what I’ve found:

Below are some figures for the training loss and training accuracy (on MNIST, using a resnet18).

Problem:

  1. Using LRFinder on a model, and then training with it afterwards appears to hurt the models learning (see pink curve below).

Solution:

  1. Using LRFinder on a model, and manually restoring the weights, appears to train the model optimally. (see green curve below).
  2. Using LRFinder on a clone of the model, and then using the original model for training, appears to train the model optimally. (see green curve below).

Regarding the figure/graphs below, both models used the same hyperparameters.

An in-code example of option 1) would be similar to what was given in the README.md:

from torch_lr_finder import LRFinder

model = ...
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-7, weight_decay=1e-2)
lr_finder = LRFinder(model, optimizer, criterion, device="cuda")
lr_finder.range_test(trainloader, end_lr=100, num_iter=100)
lr_finder.plot()

// Then use "model" for training

An in-code example of option 3) would be:

from torch_lr_finder import LRFinder

model = ...
temp_model = *create model with same architecture*
// copy weights over
temp_model.load_state_dict(model.state_dict)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-7, weight_decay=1e-2)
// use temp model in lr_finder
lr_finder = LRFinder(temp_model, optimizer, criterion, device="cuda")
lr_finder.range_test(trainloader, end_lr=100, num_iter=100)
lr_finder.plot()

image

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (13 by maintainers)

Most upvoted comments

So I ran some experiments too, check out my project page: Optimizer Benchmarks

The Jupyter Notebooks are in the GitHub Repo, you can view them with the build-in notebook viewer!

Main conclusion from project page:

  • OneCycle LR > Constant LR
  • Making a new optimizer vs. Preserving state and re-using the same optimizer both achieve very similar performance. i.e. Discarding an optimizer’s state didn’t really hurt the model’s performance, with or without an LR Scheduler.

Note: Conclusions are based on the Adam optimizer and OneCycle LR Scheduler. I haven’t experimented with other optimizers to see if dropping their state is more impactful

Oh nice!

Let me know if you’re looking for any help, I’d be more than happy to contribute on any part of this lovely repo! 😃

Thanks for all the info 😃

And no worries, it’s not too bad to reset the optimizer. As for it being trouble since different optimizers have different hyperparameters, here is how I would get around it:

optim_for_lr_find = torch.optim.*any optimizer with any settings*

optim_for_training = *some optim being used for training*

temp_dict = {}
temp_dict['state'] = optim_for_training.state_dict['state']
temp_dict['param_groups'] = optim_for_lr_find['param_groups']

optim_for_lr_find.load_state_dict(temp_dict)

// Run your LR Find with optim_for_lr_find

After plotting lr find, we can find a new optimal lr and put it in our current optimizer being used for training, without changing it’s state:

optim_with_new_lr = torch.optim.*optimizer with optimal lr*

optim_for_training = *some optim being used for training*

temp_dict = {}
temp_dict['state'] = optim_for_training.state_dict['state']
temp_dict['param_groups'] = optim_with_new_lr['param_groups']

optim_for_training.load_state_dict(temp_dict)

// Train with optim_for_training

Note: an optimizer’s state dict has a state and param_groups. Making the dictionaries manually lets you control which optimizer to take the state from, and which optimizer to take the (default) hyperparameters from (which we need a low learning rate for LR Find).

Thanks again for all your help!