pytorch-lightning: Validation Loss: Metric not available

šŸ› Bug

The metric val_loss was not found for ReduceLROnPlateau and progress bar display. But using print(val_loss on validation_step, and validation_epoch_end works ok (display Tensor(value).

Code sample

class MyModel(pl.LightningModule):    
    def __init__(self, train_df, val_df, test_df, hparams = Namespace(lr = 0.02)):
        # Initialization
        super(MyModel, self).__init__()
        self.train_df = train_df
        self.val_df = val_df
        self.test_df = test_df
        self.hparams = hparams
        
        # Model Structure
        backbone = models.resnet18(pretrained=False)
        self.features_extractor = torch.nn.Sequential(*list(backbone.children())[:-1])
        self.fc = torch.nn.Sequential(*[
            torch.nn.Linear(backbone.fc.in_features, 256, bias=True),
            torch.nn.Linear(256, 32, bias=True),
            torch.nn.Linear(32, 4, bias=True)
        ])
        
        # Loss
        self._loss = torch.nn.CrossEntropyLoss(weight=weight.float())
    
    def forward(self, x):
        x = self.features_extractor(x)
        x = x.squeeze(-1).squeeze(-1)
        x = self.fc(x)
        return x
    
    def loss(self, logits, y):
        return self._loss(logits, y)
    
    def training_step(self, batch, batch_idx):
        # 1. Inference
        x, y = batch
        y_hat = self.forward(x)
        
        # 2. Loss
        loss = self.loss(y_hat, y)
        
        # 3. Output
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        loss = self.loss(y_hat, y)
        return {'val_loss': loss}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.hparams.lr, weight_decay=0.01)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
        return [optimizer], [scheduler]

    def prepare_data(self):
        self.train_ds = ClassificationDataset(self.train_df, 'data/images')
        self.val_ds = ClassificationDataset(self.val_df, 'data/images')

    def train_dataloader(self):
        return torch.utils.data.DataLoader(self.train_ds, batch_size=256, num_workers=4, sampler=train_sampler)

    def val_dataloader(self):
        return torch.utils.data.DataLoader(self.val_ds, batch_size=64, num_workers=4)
    

Error

model = MyModel(train_df, val_df, test_df, hparams=Namespace(lr=0.001))
trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
trainer.fit(model)
---------------------------------------------------------------------------
MisconfigurationException                 Traceback (most recent call last)
<ipython-input-412-55f3b29fc11e> in <module>
      4 # Trainer
      5 trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
----> 6 trainer.fit(model)

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
    702 
    703         elif self.single_gpu:
--> 704             self.single_gpu_train(model)
    705 
    706         elif self.use_tpu:  # pragma: no-cover

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model)
    475             self.optimizers = optimizers
    476 
--> 477         self.run_pretrain_routine(model)
    478 
    479     def tpu_train(self, tpu_core_idx, model):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
    862 
    863         # CORE TRAINING LOOP
--> 864         self.train()
    865 
    866     def test(self, model: Optional[LightningModule] = None):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
    364 
    365                 # update LR schedulers
--> 366                 self.update_learning_rates(interval='epoch')
    367 
    368                 if self.max_steps and self.max_steps == self.global_step:

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in update_learning_rates(self, interval)
    779                         avail_metrics = ','.join(list(self.callback_metrics.keys()))
    780                         raise MisconfigurationException(
--> 781                             f'ReduceLROnPlateau conditioned on metric {monitor_key}'
    782                             f' which is not available. Available metrics are: {avail_metrics}.'
    783                             ' Condition can be set using `monitor` key in lr scheduler dict'

MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: . Condition can be set using `monitor` key in lr scheduler dict

Environment

  • CUDA:
    • GPU:
      • Tesla P100-PCIE-16GB
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.1
    • pyTorch_debug: False
    • pyTorch_version: 1.4.0
    • pytorch-lightning: 0.7.3
    • tensorboard: 2.2.1
    • tqdm: 4.43.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor:
    • python: 3.7.6
    • version: #1 SMP Debian 4.9.210-1 (2020-01-20)

Additional context

Dataset

class ClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, df: pd.DataFrame, root_dir: pathlib.Path, test=False):
        self.df = df
        self.test = test
        self.root_dir = root_dir
        self.transforms = transforms.Compose([
            transforms.Resize(size=(224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
        self.number_of_categories = len(self.df.time_cat.cat.categories)
    
    def __getitem__(self, index):
        if torch.is_tensor(index):
            index = index.tolist()
        sample = datasets.folder.default_loader(pathlib.Path(self.root_dir) / pathlib.Path(self.df.iloc[index]['filename']))
        sample = self.transforms(sample)
        y = int(self.df.time_cat.cat.codes.iloc[index])
        return (sample, y)
    
    def __len__(self):
        return self.df.shape[0]

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 30 (5 by maintainers)

Most upvoted comments

Ran into the same problem. It there any chance to get this fixed soon?

I found it !

PL Version: 1.6.0

def validation_epoch_end(self, outputs): avg_loss = 0.0 … do something for avg_loss … self.log(ā€œval_lossā€, avg_loss)

Pls invoke the function self.log(ā€œval_lossā€, avg_loss) in LightningModule when you overwriting ā€˜validation_epoch_end’ in your subclass.

That works for me.

I’m encountering this issue as well. My current work around is putting check_val_every_n_epochs=1 in my pl.Trainer. From some tests, it appears that if this is >1, it won’t run through the validation loop after the first epoch (as expected), the metrics in said loop will not be logged, and thus our error occurs.

I tried to outsmart PTL by adding the following code to my module.

    def on_train_start(self):
              if self.init_val_loss:
                  print("initing val loss to 1000 for metric tracking")
                  self.log("val_total_loss", 1000)
                  self.init_val_loss = False

where self.init_val_loss is set in the init method of the model. Tensorboard found and reported a logged value of 1000 for val_total_loss on every epoch, regardless of the fact that my print statement only went once. This seems like a separate issue, or a bug on my end, but my point is it didnt work and I am clearly not as smart as I thought I was.

Additionally, I’d like to note that the ModelCheckpoint callback throws a warning that it cannot find the correct metric, whereas the LR scheduler actually errors out. Seems that these should not have different behaviors.

How the heck do i make val_loss available for the LR-shceduler?

This is my error:

pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: val_early_stop_on,val_checkpoint_on,checkpoint_on. Condition can be set using `monitor` key in lr scheduler dict

I return val_loss from the validation step like this:

    def validation_step(self, batch, batch_idx):
        ...
        loss = self.loss_funciton(masks_pred, masks)
        result = pl.EvalResult(loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True, prog_bar=True)
        ...
        return result

A quick read to the code, only training metrics can be used with ReduceLROnPlateau

As a test, adding the following code to your MyModel class should make the error disappear

def training_epoch_end(self, outputs):
    return {"val_loss": 1}

Same problem here, with torch==1.9.0 pytorch-lightning==1.3.8

@HuviX On PL version 1.2.3 I didn’t have this issue, but when I switched to the new environment and installed version 1.2.8, this issue appears. So switching back to 1.2.3 worked for me

@swd543 it is not correct that learning rate schedulers in lightning cannot be conditioned on specific values. Take this example from the docs:

def configure_optimizers(self):
   optimizers = [Adam(...), SGD(...)]
   schedulers = [
      {
         'scheduler': ReduceLROnPlateau(optimizers[0], ...),
         'monitor': 'val_loss', # Default: val_loss
         'interval': 'epoch',
         'frequency': 1
      },
      LambdaLR(optimizers[1], ...)
   ]
   return optimizers, schedulers

here the scheduler gets conditioned on the monitor value which is set to be the validation loss.

@jovenwayfarer & @swd543 I don’t remember where I got it from but it does exist in the docs somewhere. See the comment over the monitor key in the scheduler dict. 😃

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(
            self.parameters(), lr=(self.lr or self.learning_rate)
        )
        lr_scheduler = ReduceLROnPlateau(optimizer, "min")
        scheduler = {
            "scheduler": lr_scheduler,
            "reduce_on_plateau": True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            "monitor": "val_checkpoint_on",
            "patience": 5,
            "mode": "min",
            "factor": 0.1,
            "verbose": True,
            "min_lr": 1e-8,
        }
        return [optimizer], [scheduler]

Is this even a bug? Seems to me that you won’t be able to reduce the learning rate based on a metric that hasn’t been evaluated yet. Thus, the frequency in lr_scheduler_config returned by configure_optimizers must always be greater than the check_val_every_n_epoch parameter of Trainer if you want to use validation metrics or set "strict": False in the lr_scheduler_config.

I am guessing the reason is that the step function of ReduceLROnPlateau is invoked before train_epoch_end(...) and hence an error happens at the end of epoch 0 when no val_loss has yet been logged?

I am facing the same issue as mentioned as of lightning 0.9.0. Are there no plans to improve upon this? As I see it, learning rate schedulers that do not work with validation losses makes me look towards other libraries.

I could get your code to work on dummy data. If you look close at the error message you will see the info. Available metrics are: .. Since not even your loss is available as a metric, this means that your train_step have not be evaluated yet.

Your problem seems to be a combination of a small data size(?), combined with your choice of train_percent_check=0.01 and a large batch size, means that your number of batches gets rounded to 0. You can see here how the number of batches are calculated: https://github.com/PyTorchLightning/pytorch-lightning/blob/f531ab957b05c97630d98fed18f9349b7e97046b/pytorch_lightning/trainer/data_loading.py#L165-L170 If I am correct that num_batches=0 in your case, this means that nothing is evaluated, and no metrics are available for your learning rate scheduler.