pytorch-lr-finder: LR Finder doesn't restore original model weights?
Hey! I love this repo, thanks for making it 💯
Everything works well except for one thing, after some digging around/experimenting, here’s what I’ve found:
Below are some figures for the training loss and training accuracy (on MNIST, using a resnet18).
Problem:
- Using LRFinder on a model, and then training with it afterwards appears to hurt the models learning (see pink curve below).
Solution:
- Using LRFinder on a model, and manually restoring the weights, appears to train the model optimally. (see green curve below).
- Using LRFinder on a clone of the model, and then using the original model for training, appears to train the model optimally. (see green curve below).
Regarding the figure/graphs below, both models used the same hyperparameters.
An in-code example of option 1) would be similar to what was given in the README.md:
from torch_lr_finder import LRFinder
model = ...
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-7, weight_decay=1e-2)
lr_finder = LRFinder(model, optimizer, criterion, device="cuda")
lr_finder.range_test(trainloader, end_lr=100, num_iter=100)
lr_finder.plot()
// Then use "model" for training
An in-code example of option 3) would be:
from torch_lr_finder import LRFinder
model = ...
temp_model = *create model with same architecture*
// copy weights over
temp_model.load_state_dict(model.state_dict)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-7, weight_decay=1e-2)
// use temp model in lr_finder
lr_finder = LRFinder(temp_model, optimizer, criterion, device="cuda")
lr_finder.range_test(trainloader, end_lr=100, num_iter=100)
lr_finder.plot()

About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 23 (13 by maintainers)
So I ran some experiments too, check out my project page: Optimizer Benchmarks
The Jupyter Notebooks are in the GitHub Repo, you can view them with the build-in notebook viewer!
Main conclusion from project page:
stateand re-using the same optimizer both achieve very similar performance. i.e. Discarding an optimizer’sstatedidn’t really hurt the model’s performance, with or without an LR Scheduler.Note: Conclusions are based on the Adam optimizer and OneCycle LR Scheduler. I haven’t experimented with other optimizers to see if dropping their
stateis more impactfulOh nice!
Let me know if you’re looking for any help, I’d be more than happy to contribute on any part of this lovely repo! 😃
Thanks for all the info 😃
And no worries, it’s not too bad to reset the optimizer. As for it being trouble since different optimizers have different hyperparameters, here is how I would get around it:
After plotting lr find, we can find a new optimal lr and put it in our current optimizer being used for training, without changing it’s state:
Note: an optimizer’s state dict has a
stateandparam_groups. Making the dictionaries manually lets you control which optimizer to take the state from, and which optimizer to take the (default) hyperparameters from (which we need a low learning rate for LR Find).Thanks again for all your help!