pytorch-lightning: ReduceLROnPlateau does not recognise val_loss despite progress_bar dict

🐛 Bug

When training my model, I get the following message:

  File "C:\Users\Luc\Miniconda3\envs\pytorch\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 371, in train
    raise MisconfigurationException(m)
pytorch_lightning.utilities.debugging.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: loss

Ihis is similar to #321for instance, but I definitely return a progress_bar dict with a val_loss key in it (see code below).

Code sample

  def training_step(self, batch, batch_idx):
       z, y_true = batch
       y_pred = self.forward(z)
       loss_val = self.loss_function(y_pred, y_true)
       return {'loss': loss_val.sqrt()}

   def validation_step(self, batch, batch_idx):
       z, y_true = batch
       lr = torch.tensor(self.optim.param_groups[0]['lr'])
       y_pred = self.forward(z)
       loss_val = self.loss_function(y_pred, y_true)
       return {'val_loss': loss_val.sqrt(), 'lr': lr}

   def validation_epoch_end(self, outputs):
       val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()
       lr = outputs[-1]['lr']
       logs = {'val_loss': val_loss_mean, 'lr': lr}
       return {'val_loss': val_loss_mean, 'progress_bar': logs, 'log': logs}

Expected behavior

The val_loss value should be picked up by the progress bar.

Environment

PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): Windows 10
How you installed PyTorch (conda, pip, source): pip
Python version: 3.6.10
CUDA/cuDNN version: 10
GPU models and configuration: 1070Ti x 1
Any other relevant information:

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (12 by maintainers)

Most upvoted comments

I do not think it is possible just out of the box. However, if you configure your scheduler correctly, then it should be possible. For example, if I initialize my Trainer as trainer = Trainer(val_check_interval=50) and initialize my scheduler as

scheduler = {
    'schduler': ReduceLROnPlateau(optimizer, mode, factor, patience),
    'interval': 'step',
    'frequency': 100
}

it should work (not tested), since val_loss will be created every 50 steps but the scheduler will first be called after 100 steps.

SkafteNicki on Mar 30, 2020

Okay, after looking at your code @alexeykarnachev, this does not seems to be a bug. When you set interval': 'step' you are calling the .step() method for ReduceLROnPlateau after each batch and it therefore makes complete sense that no val_loss is calculated yet. If you really want to do something like this, you need to set val_check_interval in the Trainer construction to a number lower than frequency in the scheduler construction. In this way val_loss will be calculated before .step() is called.

SkafteNicki on Mar 25, 2020

@SkafteNicki , btw, the trick:

        if self.trainer.global_step == 0:
            log.update({'Loss/valid': np.inf})

doesn’t help in case of warm start from checkpoint, because in warm start, the global_step is never equal to 0 😃

For now I did it like this:

        # Set up placeholders for valid metrics.
        if not self._valid_metrics_patched:
            log.update({'Loss/valid': np.inf})
            self._valid_metrics_patched = True

alexeykarnachev on Mar 22, 2020

@LucFrachon sorry, I must have misinterpret what you asked about. Yes, PL automatically calls the scheduler you define. Sorry for the confusion.

@alexeykarnachev I think that indicates that there is some bug here, since you need to setup a placeholder for the first step. I will see if I can come up with a more permanent solution.

SkafteNicki on Mar 19, 2020