pytorch-lightning: Validation Loss: Metric not available

🐛 Bug

The metric val_loss was not found for ReduceLROnPlateau and progress bar display. But using print(val_loss on validation_step, and validation_epoch_end works ok (display Tensor(value).

Code sample

class MyModel(pl.LightningModule):    
    def __init__(self, train_df, val_df, test_df, hparams = Namespace(lr = 0.02)):
        # Initialization
        super(MyModel, self).__init__()
        self.train_df = train_df
        self.val_df = val_df
        self.test_df = test_df
        self.hparams = hparams
        
        # Model Structure
        backbone = models.resnet18(pretrained=False)
        self.features_extractor = torch.nn.Sequential(*list(backbone.children())[:-1])
        self.fc = torch.nn.Sequential(*[
            torch.nn.Linear(backbone.fc.in_features, 256, bias=True),
            torch.nn.Linear(256, 32, bias=True),
            torch.nn.Linear(32, 4, bias=True)
        ])
        
        # Loss
        self._loss = torch.nn.CrossEntropyLoss(weight=weight.float())
    
    def forward(self, x):
        x = self.features_extractor(x)
        x = x.squeeze(-1).squeeze(-1)
        x = self.fc(x)
        return x
    
    def loss(self, logits, y):
        return self._loss(logits, y)
    
    def training_step(self, batch, batch_idx):
        # 1. Inference
        x, y = batch
        y_hat = self.forward(x)
        
        # 2. Loss
        loss = self.loss(y_hat, y)
        
        # 3. Output
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        loss = self.loss(y_hat, y)
        return {'val_loss': loss}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.hparams.lr, weight_decay=0.01)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
        return [optimizer], [scheduler]

    def prepare_data(self):
        self.train_ds = ClassificationDataset(self.train_df, 'data/images')
        self.val_ds = ClassificationDataset(self.val_df, 'data/images')

    def train_dataloader(self):
        return torch.utils.data.DataLoader(self.train_ds, batch_size=256, num_workers=4, sampler=train_sampler)

    def val_dataloader(self):
        return torch.utils.data.DataLoader(self.val_ds, batch_size=64, num_workers=4)

Error

model = MyModel(train_df, val_df, test_df, hparams=Namespace(lr=0.001))
trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
trainer.fit(model)

---------------------------------------------------------------------------
MisconfigurationException                 Traceback (most recent call last)
<ipython-input-412-55f3b29fc11e> in <module>
      4 # Trainer
      5 trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
----> 6 trainer.fit(model)

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
    702 
    703         elif self.single_gpu:
--> 704             self.single_gpu_train(model)
    705 
    706         elif self.use_tpu:  # pragma: no-cover

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model)
    475             self.optimizers = optimizers
    476 
--> 477         self.run_pretrain_routine(model)
    478 
    479     def tpu_train(self, tpu_core_idx, model):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
    862 
    863         # CORE TRAINING LOOP
--> 864         self.train()
    865 
    866     def test(self, model: Optional[LightningModule] = None):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
    364 
    365                 # update LR schedulers
--> 366                 self.update_learning_rates(interval='epoch')
    367 
    368                 if self.max_steps and self.max_steps == self.global_step:

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in update_learning_rates(self, interval)
    779                         avail_metrics = ','.join(list(self.callback_metrics.keys()))
    780                         raise MisconfigurationException(
--> 781                             f'ReduceLROnPlateau conditioned on metric {monitor_key}'
    782                             f' which is not available. Available metrics are: {avail_metrics}.'
    783                             ' Condition can be set using `monitor` key in lr scheduler dict'

MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: . Condition can be set using `monitor` key in lr scheduler dict

Environment

CUDA:
- GPU:
  - Tesla P100-PCIE-16GB
- available: True
- version: 10.1
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.3
- tensorboard: 2.2.1
- tqdm: 4.43.0
System:
- OS: Linux
- architecture:
  - 64bit
- processor:
- python: 3.7.6
- version: #1 SMP Debian 4.9.210-1 (2020-01-20)

Additional context

Dataset

class ClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, df: pd.DataFrame, root_dir: pathlib.Path, test=False):
        self.df = df
        self.test = test
        self.root_dir = root_dir
        self.transforms = transforms.Compose([
            transforms.Resize(size=(224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
        self.number_of_categories = len(self.df.time_cat.cat.categories)
    
    def __getitem__(self, index):
        if torch.is_tensor(index):
            index = index.tolist()
        sample = datasets.folder.default_loader(pathlib.Path(self.root_dir) / pathlib.Path(self.df.iloc[index]['filename']))
        sample = self.transforms(sample)
        y = int(self.df.time_cat.cat.codes.iloc[index])
        return (sample, y)
    
    def __len__(self):
        return self.df.shape[0]

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 30 (5 by maintainers)

Most upvoted comments

Ran into the same problem. It there any chance to get this fixed soon?

+15

HuviX on Apr 1, 2021

I found it !

PL Version： 1.6.0

def validation_epoch_end(self, outputs): avg_loss = 0.0 … do something for avg_loss … self.log(“val_loss”, avg_loss)

Pls invoke the function self.log(“val_loss”, avg_loss) in LightningModule when you overwriting ‘validation_epoch_end’ in your subclass.

That works for me.

wangxingjun778 on Apr 8, 2022

I’m encountering this issue as well. My current work around is putting check_val_every_n_epochs=1 in my pl.Trainer. From some tests, it appears that if this is >1, it won’t run through the validation loop after the first epoch (as expected), the metrics in said loop will not be logged, and thus our error occurs.

I tried to outsmart PTL by adding the following code to my module.

    def on_train_start(self):
              if self.init_val_loss:
                  print("initing val loss to 1000 for metric tracking")
                  self.log("val_total_loss", 1000)
                  self.init_val_loss = False

where self.init_val_loss is set in the init method of the model. Tensorboard found and reported a logged value of 1000 for val_total_loss on every epoch, regardless of the fact that my print statement only went once. This seems like a separate issue, or a bug on my end, but my point is it didnt work and I am clearly not as smart as I thought I was.

Additionally, I’d like to note that the ModelCheckpoint callback throws a warning that it cannot find the correct metric, whereas the LR scheduler actually errors out. Seems that these should not have different behaviors.

maxisawesome on Sep 22, 2021

How the heck do i make val_loss available for the LR-shceduler?

This is my error:

pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: val_early_stop_on,val_checkpoint_on,checkpoint_on. Condition can be set using `monitor` key in lr scheduler dict

I return val_loss from the validation step like this:

    def validation_step(self, batch, batch_idx):
        ...
        loss = self.loss_funciton(masks_pred, masks)
        result = pl.EvalResult(loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True, prog_bar=True)
        ...
        return result

vegovs on Sep 28, 2020

A quick read to the code, only training metrics can be used with ReduceLROnPlateau

As a test, adding the following code to your MyModel class should make the error disappear

def training_epoch_end(self, outputs):
    return {"val_loss": 1}

phihung on Apr 24, 2020

Same problem here, with torch==1.9.0 pytorch-lightning==1.3.8

thistlillo on Jul 5, 2021

@HuviX On PL version 1.2.3 I didn’t have this issue, but when I switched to the new environment and installed version 1.2.8, this issue appears. So switching back to 1.2.3 worked for me

korotaS on Apr 19, 2021

@swd543 it is not correct that learning rate schedulers in lightning cannot be conditioned on specific values. Take this example from the docs:

def configure_optimizers(self):
   optimizers = [Adam(...), SGD(...)]
   schedulers = [
      {
         'scheduler': ReduceLROnPlateau(optimizers[0], ...),
         'monitor': 'val_loss', # Default: val_loss
         'interval': 'epoch',
         'frequency': 1
      },
      LambdaLR(optimizers[1], ...)
   ]
   return optimizers, schedulers

here the scheduler gets conditioned on the monitor value which is set to be the validation loss.

SkafteNicki on Sep 16, 2020

@jovenwayfarer & @swd543 I don’t remember where I got it from but it does exist in the docs somewhere. See the comment over the monitor key in the scheduler dict. 😃

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(
            self.parameters(), lr=(self.lr or self.learning_rate)
        )
        lr_scheduler = ReduceLROnPlateau(optimizer, "min")
        scheduler = {
            "scheduler": lr_scheduler,
            "reduce_on_plateau": True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            "monitor": "val_checkpoint_on",
            "patience": 5,
            "mode": "min",
            "factor": 0.1,
            "verbose": True,
            "min_lr": 1e-8,
        }
        return [optimizer], [scheduler]

vegovs on Oct 1, 2020

Is this even a bug? Seems to me that you won’t be able to reduce the learning rate based on a metric that hasn’t been evaluated yet. Thus, the frequency in lr_scheduler_config returned by configure_optimizers must always be greater than the check_val_every_n_epoch parameter of Trainer if you want to use validation metrics or set "strict": False in the lr_scheduler_config.

hummat on Oct 19, 2022

I am guessing the reason is that the step function of ReduceLROnPlateau is invoked before train_epoch_end(...) and hence an error happens at the end of epoch 0 when no val_loss has yet been logged?

mostafaelhoushi on Feb 15, 2021

I am facing the same issue as mentioned as of lightning 0.9.0. Are there no plans to improve upon this? As I see it, learning rate schedulers that do not work with validation losses makes me look towards other libraries.

swd543 on Sep 15, 2020

I could get your code to work on dummy data. If you look close at the error message you will see the info. Available metrics are: .. Since not even your loss is available as a metric, this means that your train_step have not be evaluated yet.

Your problem seems to be a combination of a small data size(?), combined with your choice of train_percent_check=0.01 and a large batch size, means that your number of batches gets rounded to 0. You can see here how the number of batches are calculated: https://github.com/PyTorchLightning/pytorch-lightning/blob/f531ab957b05c97630d98fed18f9349b7e97046b/pytorch_lightning/trainer/data_loading.py#L165-L170 If I am correct that num_batches=0 in your case, this means that nothing is evaluated, and no metrics are available for your learning rate scheduler.

SkafteNicki on Apr 25, 2020