pytorch-lightning: default EarlyStopping callback should not fail on missing val_loss data

Describe the bug My training script failed overnight — this is the last thing I see in the logs before the instance shut down:

python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: avg_val_loss_total,avg_val_jacc10,avg_val_ce
  RuntimeWarning)

It seems like we intended this to be a “warning” but it appears that it interrupted my training script. Do you think that’s possible, or could it be something else? I had 2 training scripts running on 2 different instances last night and both shut down in this way, with this RuntimeWarning as the last line in the logs. Is it possible that the default EarlyStopping callback killed my script because I didn’t log a val_loss tensor somewhere it could find it? To be clear, it is not my intention to use EarlyStopping at all, so I was quite surprised to wake up today and find my instance shut down and training interrupted, and no clear sign of a bug on my end. Did you intend this to interrupt the trainer? If so, how do we feel about changing that plan so that the default EarlyStopping callback has no effect when it can’t find a val_loss metric?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 20 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I agree that it is quite unpleasant when the default early stopping callback unexpectedly stops the training because it can’t find val_loss. And it is also unpleasant that you will find out the absence of the required metric only at the end of the full first training epoch (and moreover the training will stop). So I would separate this into two different problems:

  1. Default early stopping should not stop the training. We should either disable it if no val_loss was found or even just disable it at all by default.
  2. We should somehow check in the very beginning of the training that the metric required by the early stopping is available after the validation loop. Now it is checked only at the end of the first training epoch and if it is not present the training stops.

I’m guessing you were doing check_val_every_n_epoch>1. This error is because callback_metrics is what is used for early stopping. This is cleared and re-filled at every training step logging. A hacky solution I have found is to save the last val_loss as a model attribute self.val_loss, and return at every training step Ex. output { ‘loss’: loss, ‘log’: log_dict, ‘progress_bar’: prog_dict, ‘val_loss’: self.val_loss }

thanks for the quick update. pytorch-lightning-1.5.0.dev0 (current master branch) works.

Hello. I am still getting a similar problem. Has this been confirmed as solved?