pytorch-lightning: default EarlyStopping callback should not fail on missing val_loss data

Describe the bug My training script failed overnight — this is the last thing I see in the logs before the instance shut down:

python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: avg_val_loss_total,avg_val_jacc10,avg_val_ce
  RuntimeWarning)

It seems like we intended this to be a “warning” but it appears that it interrupted my training script. Do you think that’s possible, or could it be something else? I had 2 training scripts running on 2 different instances last night and both shut down in this way, with this RuntimeWarning as the last line in the logs. Is it possible that the default EarlyStopping callback killed my script because I didn’t log a val_loss tensor somewhere it could find it? To be clear, it is not my intention to use EarlyStopping at all, so I was quite surprised to wake up today and find my instance shut down and training interrupted, and no clear sign of a bug on my end. Did you intend this to interrupt the trainer? If so, how do we feel about changing that plan so that the default EarlyStopping callback has no effect when it can’t find a val_loss metric?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (14 by maintainers)

Commits related to this issue

disable early stopping; there is a bug when validation percentage is set will hopefully be fixed (https://github.com/williamFalcon/pytorch-lightning/issues/524) add 3d model without dropout — committed to fellnerse/forgerydetection by fellnerse 5 years ago

Most upvoted comments

I agree that it is quite unpleasant when the default early stopping callback unexpectedly stops the training because it can’t find val_loss. And it is also unpleasant that you will find out the absence of the required metric only at the end of the full first training epoch (and moreover the training will stop). So I would separate this into two different problems:

Default early stopping should not stop the training. We should either disable it if no val_loss was found or even just disable it at all by default.
We should somehow check in the very beginning of the training that the metric required by the early stopping is available after the validation loop. Now it is checked only at the end of the first training epoch and if it is not present the training stops.

kuynzereb on Nov 19, 2019

I’m guessing you were doing check_val_every_n_epoch>1. This error is because callback_metrics is what is used for early stopping. This is cleared and re-filled at every training step logging. A hacky solution I have found is to save the last val_loss as a model attribute self.val_loss, and return at every training step Ex. output { ‘loss’: loss, ‘log’: log_dict, ‘progress_bar’: prog_dict, ‘val_loss’: self.val_loss }

AS-researcher6 on Nov 20, 2019

thanks for the quick update. pytorch-lightning-1.5.0.dev0 (current master branch) works.

TuBui on Sep 14, 2021

Hello. I am still getting a similar problem. Has this been confirmed as solved?

veritas9872 on Feb 10, 2021