pytorch-lightning: ReduceLROnPlateau does not recognise val_loss despite progress_bar dict
π Bug
When training my model, I get the following message:
File "C:\Users\Luc\Miniconda3\envs\pytorch\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 371, in train
raise MisconfigurationException(m)
pytorch_lightning.utilities.debugging.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: loss
Ihis is similar to #321for instance, but I definitely return a progress_bar dict with a val_loss key in it (see code below).
Code sample
def training_step(self, batch, batch_idx):
z, y_true = batch
y_pred = self.forward(z)
loss_val = self.loss_function(y_pred, y_true)
return {'loss': loss_val.sqrt()}
def validation_step(self, batch, batch_idx):
z, y_true = batch
lr = torch.tensor(self.optim.param_groups[0]['lr'])
y_pred = self.forward(z)
loss_val = self.loss_function(y_pred, y_true)
return {'val_loss': loss_val.sqrt(), 'lr': lr}
def validation_epoch_end(self, outputs):
val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()
lr = outputs[-1]['lr']
logs = {'val_loss': val_loss_mean, 'lr': lr}
return {'val_loss': val_loss_mean, 'progress_bar': logs, 'log': logs}
Expected behavior
The val_loss value should be picked up by the progress bar.
Environment
- PyTorch Version (e.g., 1.0): 1.4.0
- OS (e.g., Linux): Windows 10
- How you installed PyTorch (
conda,pip, source): pip - Python version: 3.6.10
- CUDA/cuDNN version: 10
- GPU models and configuration: 1070Ti x 1
- Any other relevant information:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (12 by maintainers)
I do not think it is possible just out of the box. However, if you configure your scheduler correctly, then it should be possible. For example, if I initialize my Trainer as
trainer = Trainer(val_check_interval=50)and initialize my scheduler asit should work (not tested), since
val_losswill be created every 50 steps but the scheduler will first be called after 100 steps.Okay, after looking at your code @alexeykarnachev, this does not seems to be a bug. When you set
interval': 'step'you are calling the.step()method forReduceLROnPlateauafter each batch and it therefore makes complete sense that noval_lossis calculated yet. If you really want to do something like this, you need to setval_check_intervalin theTrainerconstruction to a number lower thanfrequencyin the scheduler construction. In this wayval_losswill be calculated before.step()is called.@SkafteNicki , btw, the trick:
doesnβt help in case of warm start from checkpoint, because in warm start, the
global_stepis never equal to 0 πFor now I did it like this:
@LucFrachon sorry, I must have misinterpret what you asked about. Yes, PL automatically calls the scheduler you define. Sorry for the confusion.
@alexeykarnachev I think that indicates that there is some bug here, since you need to setup a placeholder for the first step. I will see if I can come up with a more permanent solution.