pytorch-lightning: Validation Loss: Metric not available
š Bug
The metric val_loss
was not found for ReduceLROnPlateau
and progress bar display. But using print(val_loss
on validation_step
, and validation_epoch_end
works ok (display Tensor(value)
.
Code sample
class MyModel(pl.LightningModule):
def __init__(self, train_df, val_df, test_df, hparams = Namespace(lr = 0.02)):
# Initialization
super(MyModel, self).__init__()
self.train_df = train_df
self.val_df = val_df
self.test_df = test_df
self.hparams = hparams
# Model Structure
backbone = models.resnet18(pretrained=False)
self.features_extractor = torch.nn.Sequential(*list(backbone.children())[:-1])
self.fc = torch.nn.Sequential(*[
torch.nn.Linear(backbone.fc.in_features, 256, bias=True),
torch.nn.Linear(256, 32, bias=True),
torch.nn.Linear(32, 4, bias=True)
])
# Loss
self._loss = torch.nn.CrossEntropyLoss(weight=weight.float())
def forward(self, x):
x = self.features_extractor(x)
x = x.squeeze(-1).squeeze(-1)
x = self.fc(x)
return x
def loss(self, logits, y):
return self._loss(logits, y)
def training_step(self, batch, batch_idx):
# 1. Inference
x, y = batch
y_hat = self.forward(x)
# 2. Loss
loss = self.loss(y_hat, y)
# 3. Output
tensorboard_logs = {'train_loss': loss}
return {'loss': loss, 'log': tensorboard_logs}
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(x)
loss = self.loss(y_hat, y)
return {'val_loss': loss}
def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.hparams.lr, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
return [optimizer], [scheduler]
def prepare_data(self):
self.train_ds = ClassificationDataset(self.train_df, 'data/images')
self.val_ds = ClassificationDataset(self.val_df, 'data/images')
def train_dataloader(self):
return torch.utils.data.DataLoader(self.train_ds, batch_size=256, num_workers=4, sampler=train_sampler)
def val_dataloader(self):
return torch.utils.data.DataLoader(self.val_ds, batch_size=64, num_workers=4)
Error
model = MyModel(train_df, val_df, test_df, hparams=Namespace(lr=0.001))
trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
trainer.fit(model)
---------------------------------------------------------------------------
MisconfigurationException Traceback (most recent call last)
<ipython-input-412-55f3b29fc11e> in <module>
4 # Trainer
5 trainer = pl.Trainer(gpus=1, max_epochs=2, train_percent_check=0.01, weights_summary='top')
----> 6 trainer.fit(model)
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
702
703 elif self.single_gpu:
--> 704 self.single_gpu_train(model)
705
706 elif self.use_tpu: # pragma: no-cover
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model)
475 self.optimizers = optimizers
476
--> 477 self.run_pretrain_routine(model)
478
479 def tpu_train(self, tpu_core_idx, model):
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
862
863 # CORE TRAINING LOOP
--> 864 self.train()
865
866 def test(self, model: Optional[LightningModule] = None):
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
364
365 # update LR schedulers
--> 366 self.update_learning_rates(interval='epoch')
367
368 if self.max_steps and self.max_steps == self.global_step:
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in update_learning_rates(self, interval)
779 avail_metrics = ','.join(list(self.callback_metrics.keys()))
780 raise MisconfigurationException(
--> 781 f'ReduceLROnPlateau conditioned on metric {monitor_key}'
782 f' which is not available. Available metrics are: {avail_metrics}.'
783 ' Condition can be set using `monitor` key in lr scheduler dict'
MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: . Condition can be set using `monitor` key in lr scheduler dict
Environment
- CUDA:
- GPU:
- Tesla P100-PCIE-16GB
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.3
- tensorboard: 2.2.1
- tqdm: 4.43.0
- System:
- OS: Linux
- architecture:
- 64bit
- processor:
- python: 3.7.6
- version: #1 SMP Debian 4.9.210-1 (2020-01-20)
Additional context
Dataset
class ClassificationDataset(torch.utils.data.Dataset):
def __init__(self, df: pd.DataFrame, root_dir: pathlib.Path, test=False):
self.df = df
self.test = test
self.root_dir = root_dir
self.transforms = transforms.Compose([
transforms.Resize(size=(224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
self.number_of_categories = len(self.df.time_cat.cat.categories)
def __getitem__(self, index):
if torch.is_tensor(index):
index = index.tolist()
sample = datasets.folder.default_loader(pathlib.Path(self.root_dir) / pathlib.Path(self.df.iloc[index]['filename']))
sample = self.transforms(sample)
y = int(self.df.time_cat.cat.codes.iloc[index])
return (sample, y)
def __len__(self):
return self.df.shape[0]
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 30 (5 by maintainers)
Ran into the same problem. It there any chance to get this fixed soon?
I found it !
PL Versionļ¼ 1.6.0
def validation_epoch_end(self, outputs): avg_loss = 0.0 ⦠do something for avg_loss ⦠self.log(āval_lossā, avg_loss)
Pls invoke the function self.log(āval_lossā, avg_loss) in LightningModule when you overwriting āvalidation_epoch_endā in your subclass.
That works for me.
Iām encountering this issue as well. My current work around is putting
check_val_every_n_epochs=1
in mypl.Trainer
. From some tests, it appears that if this is >1, it wonāt run through the validation loop after the first epoch (as expected), the metrics in said loop will not be logged, and thus our error occurs.I tried to outsmart PTL by adding the following code to my module.
where
self.init_val_loss
is set in the init method of the model. Tensorboard found and reported a logged value of 1000 for val_total_loss on every epoch, regardless of the fact that my print statement only went once. This seems like a separate issue, or a bug on my end, but my point is it didnt work and I am clearly not as smart as I thought I was.Additionally, Iād like to note that the
ModelCheckpoint
callback throws a warning that it cannot find the correct metric, whereas the LR scheduler actually errors out. Seems that these should not have different behaviors.How the heck do i make
val_loss
available for the LR-shceduler?This is my error:
I return
val_loss
from the validation step like this:A quick read to the code, only training metrics can be used with
ReduceLROnPlateau
As a test, adding the following code to your MyModel class should make the error disappear
Same problem here, with torch==1.9.0 pytorch-lightning==1.3.8
@HuviX On PL version 1.2.3 I didnāt have this issue, but when I switched to the new environment and installed version 1.2.8, this issue appears. So switching back to 1.2.3 worked for me
@swd543 it is not correct that learning rate schedulers in lightning cannot be conditioned on specific values. Take this example from the docs:
here the scheduler gets conditioned on the
monitor
value which is set to be the validation loss.@jovenwayfarer & @swd543 I donāt remember where I got it from but it does exist in the docs somewhere. See the comment over the
monitor
key in thescheduler
dict. šIs this even a bug? Seems to me that you wonāt be able to reduce the learning rate based on a metric that hasnāt been evaluated yet. Thus, the
frequency
inlr_scheduler_config
returned byconfigure_optimizers
must always be greater than thecheck_val_every_n_epoch
parameter ofTrainer
if you want to use validation metrics or set"strict": False
in thelr_scheduler_config
.I am guessing the reason is that the step function of
ReduceLROnPlateau
is invoked beforetrain_epoch_end(...)
and hence an error happens at the end of epoch 0 when noval_loss
has yet been logged?I am facing the same issue as mentioned as of lightning 0.9.0. Are there no plans to improve upon this? As I see it, learning rate schedulers that do not work with validation losses makes me look towards other libraries.
I could get your code to work on dummy data. If you look close at the error message you will see the info.
Available metrics are: .
. Since not even yourloss
is available as a metric, this means that yourtrain_step
have not be evaluated yet.Your problem seems to be a combination of a small data size(?), combined with your choice of
train_percent_check=0.01
and a large batch size, means that your number of batches gets rounded to 0. You can see here how the number of batches are calculated: https://github.com/PyTorchLightning/pytorch-lightning/blob/f531ab957b05c97630d98fed18f9349b7e97046b/pytorch_lightning/trainer/data_loading.py#L165-L170 If I am correct thatnum_batches=0
in your case, this means that nothing is evaluated, and no metrics are available for your learning rate scheduler.