pytorch-lightning: LR finder broken 2: not sure why (and other tiny bugs)
đ Bug
LR finder doesnât seem to work. The model doesnât train when trainer.lr_find(model) is running (the loss metric oscillates around its initial value). When looking at the figure from lr_finder.plot(), I suspected the learning rate wasnât being changed somehow, but internally it does. So I rebuilt a custom LR finder to check if it wasnât the rest of my code. It seems lr_find is broken, but Iâm not sure why, since the implementation is kinda complex for me to debug. People might get wrong results if they donât check lr_finder.plot() before using lr_find.suggestion().
To Reproduce
Steps to reproduce the behavior:
- Clone this test repository
- Run the corresponding script (
run.batorrun.sh) - Compare plot results for LR finder and a custom LR finder (
lr_finder.pngandcustom_lr_finder.png)
Edit: I made a new branch called lr_bug, so please refer to that code instead
Code sample
The sample code is available on this repo. It trains ResNet-s on CIFAR10 with 10% of the train/val set for 10 epochs with initial learning_rate=1e-5 and end_lr=1.
Following is a stripped-down version of it:
# -----------------------------
# 3 FIND INITIAL LEARNING RATE
# -----------------------------
# Run learning rate finder
lr_finder = trainer.lr_find(
model,
num_training=hparams.epochs*model.batches_per_epoch,
min_lr=hparams.learning_rate,
mode='exponential')
# Plot
import matplotlib.pyplot as plt
fig = lr_finder.plot(suggest=True)
fig.tight_layout()
fig.savefig('lr_finder.png', dpi=300, format='png')
# Pick point based on plot, or get suggestion
new_lr = lr_finder.suggestion()
# -------------------------------------
# 4 FIND INITIAL LEARNING RATE (CUSTOM)
# -------------------------------------
# the scheduler is already configured as a LR sweeper
trainer.fit(model)
# get metrics from a custom CSV logger callback
metrics = trainer.callbacks[1].batch_metrics
loss = metrics['loss'].values
# Same as lr_finder.suggestion(), but with a moving average filter
index, lrs, loss = lr_suggestion(metrics, model.batches_per_epoch)
custom_lr = lrs[index]
# Plot
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(metrics['lr'], metrics['loss'], ':', label='Per Batch')
ax.plot(lrs, loss, label='Filtered ("Per Epoch")')
ax.plot(lrs[index], loss[index], 'ro', label='Suggestion')
ax.set_xscale('log')
ax.set_xlabel('Learning Rate')
ax.set_ylabel('Loss')
ax.legend()
fig.tight_layout()
fig.savefig('custom_lr_finder.png', dpi=300, format='png')
The âcustomâ learning rate finder is supposed to replicate lr_finder, itâs just the same scheduler (lr_finder._ExponentialLR) and a custom CSV logger callback which logs lr collected from inside the training loop:
def training_step(self, batch, batch_idx):
# forward pass
x, y = batch
y_hat = self.forward(x)
# calculate loss
loss_val = self.loss(y_hat, y)
# acc
acc = ...
# lr
lr = self.trainer.lr_schedulers[0]['scheduler']._last_lr[0]
tqdm_dict = {'acc': acc, 'lr': lr}
output = OrderedDict({
'loss': loss_val,
'progress_bar': tqdm_dict,
'log': tqdm_dict
})
# can also return just a scalar instead of a dict (return loss_val)
return output
def configure_optimizers(self):
optimizer = optim.SGD(self.parameters(),
self.hparams.learning_rate,
momentum=self.hparams.momentum,
weight_decay=self.hparams.weight_decay)
customlr = _ExponentialLR
# customlr = _LinearLR
clr = customlr(
optimizer,
end_lr=1,
num_iter=self.hparams.epochs*self.batches_per_epoch,
last_epoch=-1
)
scheduler = dict(scheduler=clr,
interval='step')
return [optimizer], [scheduler]
When calculating the learning rate suggestion, a moving average filter was applied (with size batches_per_epoch). This prevents amplifying the noise with np.gradient() and getting wrong results from a âlucky batchâ. scipy.signal.filtfilt is necessary to avoid a delay in the loss array. I removed the line with loss = loss[np.isfinite(loss)] for simplicity (and because of a potential bug when loss contains NaNs).
def lr_suggestion(metrics, filter_size=100, skip_begin=10, skip_end=1):
loss = metrics['loss'].values
lrs = metrics['lr'].values
# if loss has any NaN values, lrs.size != loss.size,
# which would result in the wrong index for lrs
# this code assumes there's no NaNs in loss
# loss = loss[np.isfinite(loss)]
# Moving average before calculating the "gradient"
from scipy import signal
coef = np.ones(filter_size) / filter_size
loss = signal.filtfilt(coef, 1, loss)
index = np.gradient(loss[skip_begin:-skip_end]).argmin() + skip_begin
return index, lrs, loss
Expected behavior
LR finder plot results (not expected):

Custom LR finder (blue line is the expected behavior):
Environment
- CUDA: - GPU: - GeForce GTX 950M - available: True - version: 10.2
- Packages: - numpy: 1.19.1 - pyTorch_debug: False - pyTorch_version: 1.6.0 - pytorch-lightning: 0.8.5 - tensorboard: 2.2.1 - tqdm: 4.47.0
- System: - OS: Windows - architecture: - 64bit - WindowsPE - processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel - python: 3.7.7 - version: 10.0.19041
Additional context
PS: When debugging this problem, I noticed that LearningRateLogger only supports 'steps' and 'epoch' as an interval, not logging the lr when interval == 'batch'. The sample code has a simple fix which changes 2 lines of code (L68 and L82) to latest_stat = self._extract_lr(trainer, ['step', 'batch']) and if scheduler['interval'] in interval:, respectively.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (8 by maintainers)
IMPORTANT: Please update the documentation with a warning about this feature
It took too much time to make sure this wasnât a mistake on my part, and I feel like people probably wonât notice itâs broken if used in a production setting. The only way to check is looking at the
lr_finder.plot(), but even then the problem might not be clear. Since this is a âstableâ feature, itâs expected to be working out of the box. I think this makes it necessary to warn users while the breaking bug is not fixed.PS: You can find one of the possible âbreaking linesâ in
pytorch_lightning/trainer/lr_finder.py#L181.