pytorch-lightning: ModelCheckpoint is not saving top k models

🐛 Bug

ModelCheckpoint is not correctly monitoring metric values.

To Reproduce

https://colab.research.google.com/drive/1onBmED7dngP_VwFxcFBMsnQi82KbizSk?usp=sharing

Expected behavior

ModelCheckpoint should save top k models based on x metric, but it currently displays Epoch XXX, step XXX: x was not in top 2 for every epoch.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.2.0
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

The documentation doesn’t mention how one should set the metric to be used in ModelCheckpoint. Tried to use both x or loss value, but ModelCheckpoint shows the same message for both cases. Also, the message should be more clear, saying that ModelCheckpoint couldn’t find chosen value to monitor instead of saying that it was not in top k, since it displays the same message if I choose to monitor some value that doesn’t exist.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (14 by maintainers)

Most upvoted comments

Yes, I can send the PR for the doc. What do you think about

Every metric logged with self.log or self.log_dict in LightningModule is a candidate for the monitor key. For more information, see checkpoint saving.