pytorch-lightning: Validation Loss: Metric not available
š Bug
The metric val_loss was not found for ReduceLROnPlateau and progress bar display. But using print(val_loss on validation_step, and validation_epoch_end works ok (display Tensor(value).
Code sample
class MyModel(pl.LightningModule):
def __init__(self, train_df, val_df, test_df, hparams = Namespace(lr = 0.02)):
# Initialization
super(MyModel, self).__init__()
self.train_df = train_df
self.val_df = val_df
self.test_df = test_df
self.hparams = hparams
# Model Structure
backbone = models.resnet18(pretrained=False)
self.features_extractor = torch.nn.Sequential(*list(backbone.children())[:-1])
self.fc = torch.nn.Sequential(*[
torch.nn.Linear(backbone.fc.in_features, 256, bias=True),
torch.nn.Linear(256, 32, bias=True),
torch.nn.Linear(32, 4, bias=True)
])
# Loss
self._loss = torch.nn.CrossEntropyLoss(weight=weight.float())
def forward(self, x):
x = self.features_extractor(x)
x = x.squeeze(-1).squeeze(-1)
x = self.fc(x)
return x
def loss(self, logits, y):
return self._loss(logits, y)
def training_step(self, batch, batch_idx):
# 1. Inference
x, y = batch
y_hat = self.forward(x)
# 2. Loss
loss = self.loss(y_hat, y)
# 3. Output
tensorboard_logs = {'train_loss': loss}
return {'loss': loss, 'log': tensorboard_logs}
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(x)
loss = self.loss(y_hat, y)
return {'val_loss': loss}
def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.hparams.lr, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
return [optimizer], [scheduler]
def prepare_data(self):
self.train_ds = ClassificationDataset(self.train_df, 'data/images')
self.val_ds = ClassificationDataset(self.val_df, 'data/images')
def train_dataloader(self):
return torch.utils.data.DataLoader(self.train_ds, batch_size=256, num_workers=4, sampler=train_sampler)
def val_dataloader(self):
return torch.utils.data.DataLoader(self.val_ds, batch_size=64, num_workers=4)
Error
model = MyModel(train_df, val_df, test_df, hparams=Namespace(lr=0.001))
trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
trainer.fit(model)
---------------------------------------------------------------------------
MisconfigurationException Traceback (most recent call last)
<ipython-input-412-55f3b29fc11e> in <module>
4 # Trainer
5 trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
----> 6 trainer.fit(model)
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
702
703 elif self.single_gpu:
--> 704 self.single_gpu_train(model)
705
706 elif self.use_tpu: # pragma: no-cover
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model)
475 self.optimizers = optimizers
476
--> 477 self.run_pretrain_routine(model)
478
479 def tpu_train(self, tpu_core_idx, model):
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
862
863 # CORE TRAINING LOOP
--> 864 self.train()
865
866 def test(self, model: Optional[LightningModule] = None):
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
364
365 # update LR schedulers
--> 366 self.update_learning_rates(interval='epoch')
367
368 if self.max_steps and self.max_steps == self.global_step:
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in update_learning_rates(self, interval)
779 avail_metrics = ','.join(list(self.callback_metrics.keys()))
780 raise MisconfigurationException(
--> 781 f'ReduceLROnPlateau conditioned on metric {monitor_key}'
782 f' which is not available. Available metrics are: {avail_metrics}.'
783 ' Condition can be set using `monitor` key in lr scheduler dict'
MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: . Condition can be set using `monitor` key in lr scheduler dict
Environment
- CUDA:
- GPU:
- Tesla P100-PCIE-16GB
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.3
- tensorboard: 2.2.1
- tqdm: 4.43.0
- System:
- OS: Linux
- architecture:
- 64bit
- processor:
- python: 3.7.6
- version: #1 SMP Debian 4.9.210-1 (2020-01-20)
Additional context
Dataset
class ClassificationDataset(torch.utils.data.Dataset):
def __init__(self, df: pd.DataFrame, root_dir: pathlib.Path, test=False):
self.df = df
self.test = test
self.root_dir = root_dir
self.transforms = transforms.Compose([
transforms.Resize(size=(224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
self.number_of_categories = len(self.df.time_cat.cat.categories)
def __getitem__(self, index):
if torch.is_tensor(index):
index = index.tolist()
sample = datasets.folder.default_loader(pathlib.Path(self.root_dir) / pathlib.Path(self.df.iloc[index]['filename']))
sample = self.transforms(sample)
y = int(self.df.time_cat.cat.codes.iloc[index])
return (sample, y)
def __len__(self):
return self.df.shape[0]
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 30 (5 by maintainers)
Ran into the same problem. It there any chance to get this fixed soon?
I found it !
PL Versionļ¼ 1.6.0
def validation_epoch_end(self, outputs): avg_loss = 0.0 ⦠do something for avg_loss ⦠self.log(āval_lossā, avg_loss)
Pls invoke the function self.log(āval_lossā, avg_loss) in LightningModule when you overwriting āvalidation_epoch_endā in your subclass.
That works for me.
Iām encountering this issue as well. My current work around is putting
check_val_every_n_epochs=1in mypl.Trainer. From some tests, it appears that if this is >1, it wonāt run through the validation loop after the first epoch (as expected), the metrics in said loop will not be logged, and thus our error occurs.I tried to outsmart PTL by adding the following code to my module.
where
self.init_val_lossis set in the init method of the model. Tensorboard found and reported a logged value of 1000 for val_total_loss on every epoch, regardless of the fact that my print statement only went once. This seems like a separate issue, or a bug on my end, but my point is it didnt work and I am clearly not as smart as I thought I was.Additionally, Iād like to note that the
ModelCheckpointcallback throws a warning that it cannot find the correct metric, whereas the LR scheduler actually errors out. Seems that these should not have different behaviors.How the heck do i make
val_lossavailable for the LR-shceduler?This is my error:
I return
val_lossfrom the validation step like this:A quick read to the code, only training metrics can be used with
ReduceLROnPlateauAs a test, adding the following code to your MyModel class should make the error disappear
Same problem here, with torch==1.9.0 pytorch-lightning==1.3.8
@HuviX On PL version 1.2.3 I didnāt have this issue, but when I switched to the new environment and installed version 1.2.8, this issue appears. So switching back to 1.2.3 worked for me
@swd543 it is not correct that learning rate schedulers in lightning cannot be conditioned on specific values. Take this example from the docs:
here the scheduler gets conditioned on the
monitorvalue which is set to be the validation loss.@jovenwayfarer & @swd543 I donāt remember where I got it from but it does exist in the docs somewhere. See the comment over the
monitorkey in theschedulerdict. šIs this even a bug? Seems to me that you wonāt be able to reduce the learning rate based on a metric that hasnāt been evaluated yet. Thus, the
frequencyinlr_scheduler_configreturned byconfigure_optimizersmust always be greater than thecheck_val_every_n_epochparameter ofTrainerif you want to use validation metrics or set"strict": Falsein thelr_scheduler_config.I am guessing the reason is that the step function of
ReduceLROnPlateauis invoked beforetrain_epoch_end(...)and hence an error happens at the end of epoch 0 when noval_losshas yet been logged?I am facing the same issue as mentioned as of lightning 0.9.0. Are there no plans to improve upon this? As I see it, learning rate schedulers that do not work with validation losses makes me look towards other libraries.
I could get your code to work on dummy data. If you look close at the error message you will see the info.
Available metrics are: .. Since not even yourlossis available as a metric, this means that yourtrain_stephave not be evaluated yet.Your problem seems to be a combination of a small data size(?), combined with your choice of
train_percent_check=0.01and a large batch size, means that your number of batches gets rounded to 0. You can see here how the number of batches are calculated: https://github.com/PyTorchLightning/pytorch-lightning/blob/f531ab957b05c97630d98fed18f9349b7e97046b/pytorch_lightning/trainer/data_loading.py#L165-L170 If I am correct thatnum_batches=0in your case, this means that nothing is evaluated, and no metrics are available for your learning rate scheduler.