pytorch-lightning: ModelCheckpoint: save_top_k > 1 cannot recognize ordering of models from ckpt names
π Bug
Short: ModelCheckpoint callback with save_top_k does not use semantic meaning (does not reflect order of models) in naming of the files.
Say for save_top_k=5 it saves 5 best models into 5 files, but the user does not know which is which! (The ModelCheckpoint does, but only for the training session, not if we want to resume).
My motivation: I want to access β2nd bestβ model.
To Reproduce
Reproducible code: https://github.com/PyTorchLightning/pytorch-lightning/issues/9944#issuecomment-944610705
Using this ModelCheckpoint config for the example
model_checkpoint:
_target_: pytorch_lightning.callbacks.ModelCheckpoint
monitor: 'val/loss' # name of the logged metric
save_top_k: 5 # save k best models (-1 save all, 0 don't save)
save_last: True # always save model from last epoch
verbose: True # show more detailed info during training
mode: min # can be "max" or "min"
dirpath: 'xxx'
filename: 'best' # use the current epoch number for naming the checkpoint, metrics may also be used
The ModelCheckpoint 1just cyclically saves the new best model as best_v{current_epoch % k}.ckpt, as seen from the following training log:
Training: -1it [00:00, ?it/s]EPOCH 0
Epoch 0: 100%|ββββββββββ| 2100/2100 [00:21<00:00, 96.04it/s, lostrain error: 0.3088467121
validation error: 0.1557712555
Epoch 0, global step 2055: val/loss reached 0.15577 (best 0.15577), saving model to "xxx/best.ckpt" as top 5
EPOCH 1
Epoch 1: 100%|ββββββββββ| 2100/2100 [00:20<00:00, 100.72it/s, lotrain error: 0.3018136621
validation error: 0.1548990458
Epoch 1, global step 4111: val/loss reached 0.15490 (best 0.15490), saving model to "xxx/best-v1.ckpt" as top 5
EPOCH 2
Epoch 2: 100%|ββββββββββ| 2100/2100 [00:21<00:00, 97.58it/s, lostrain error: 0.2986578345
validation error: 0.1544228494
Epoch 2, global step 6167: val/loss reached 0.15442 (best 0.15442), saving model to "xxx/best-v2.ckpt" as top 5
EPOCH 3
Epoch 3: 100%|ββββββββββ| 2100/2100 [00:21<00:00, 98.33it/s, lostrain error: 0.2965526581
validation error: 0.1539662182
Epoch 3, global step 8223: val/loss reached 0.15397 (best 0.15397), saving model to "xxx/best-v3.ckpt" as top 5
EPOCH 4
Epoch 4: 100%|ββββββββββ| 2100/2100 [00:21<00:00, 99.14it/s, lostrain error: 0.2950037122
validation error: 0.1536256075
Epoch 4, global step 10279: val/loss reached 0.15363 (best 0.15363), saving model to "xxx/best-v4.ckpt" as top 5
EPOCH 5
Epoch 5: 100%|ββββββββββ| 2100/2100 [00:21<00:00, 97.18it/s, lostrain error: 0.2937672734
validation error: 0.1534034163
Epoch 5, global step 12335: val/loss reached 0.15340 (best 0.15340), saving model to "xxx/best.ckpt" as top 5
EPOCH 6
Epoch 6: 100%|ββββββββββ| 2100/2100 [00:21<00:00, 98.24it/s, lostrain error: 0.2926785052
validation error: 0.1531589478
Epoch 6, global step 14391: val/loss reached 0.15316 (best 0.15316), saving model to "xxx/best-v1.ckpt" as top 5
EPOCH 7
Epoch 7: 100%|ββββββββββ| 2100/2100 [00:21<00:00, 96.33it/s, lostrain error: 0.2916747034
validation error: 0.1529426873
Epoch 7, global step 16447: val/loss reached 0.15294 (best 0.15294), saving model to "xxx/best-v2.ckpt" as top 5
EPOCH 8
Epoch 8: 100%|ββββββββββ| 2100/2100 [00:22<00:00, 92.43it/s, lostrain error: 0.2907347977
validation error: 0.1527983993
Epoch 8, global step 18503: val/loss reached 0.15280 (best 0.15280), saving model to "xxx/best-v3.ckpt" as top 5
EPOCH 9
Epoch 9: 100%|ββββββββββ| 2100/2100 [00:20<00:00, 101.61it/s, lotrain error: ] 0.2898018062
validation error: 0.1526378989
Epoch 9, global step 20559: val/loss reached 0.15264 (best 0.15264), saving model to "xxx/best-v4.ckpt" as top 5
Epoch 9: 100%|ββββββββββ| 2100/2100 [00:20<00:00, 101.42it/s, loss=0.289, v_num=gdir]Saving latest checkpoint...
Setting period from checkpoint test_set
In this example, the real best model is:
Epoch 9, global step 20559: val/loss reached 0.15264 (best 0.15264), saving model to βxxx/best-v4.ckptβ
Now the trick is that in trainer.test(..., ckpt_path="best") (same for validate() and predict() )
the "best" is not "best.ckpt" but the best filename that only the callback knows, as seen from the following code that is used by above methods:
def __load_ckpt_weights(self, ckpt_path: Optional[str]) -> Optional[str]:
if ckpt_path is None:
return
fn = self.state.fn.value
if ckpt_path == "best":
# if user requests the best checkpoint but we don't have it, error
if not self.checkpoint_callback.best_model_path:
if self.fast_dev_run:
raise MisconfigurationException(
f"You cannot execute `.{fn}()` with `fast_dev_run=True` unless you do"
f" `.{fn}(ckpt_path=PATH)` as no checkpoint path was generated during fitting."
)
raise MisconfigurationException(
f'`.{fn}(ckpt_path="best")` is set but `ModelCheckpoint` is not configured to save the best model.'
)
# load best weights
ckpt_path = self.checkpoint_callback.best_model_path
...
Problems:
- might not work after the callback is finised (end of training session). Then the info βwhich saved checkpoint file is really the bestβ is lost.
- no semantics of filenames (ordering)
- best.ckpt is not always the best model. That is confusing.
Expected behavior
The ModelCheckpoint callback would keep an ordering of the top_k best models. And re-save them accordingly on each change.
So best.ckpt = always the best model, best_v4.ckpt = 5th best model. The filenames (semantics) would remain the same during training, the content (models) of the files would update.
Benefits:
best.ckptis the best model, so users can manually load it for other use-cases thantest( ckpt_path="best")- we can access Nth best model
Environment
latest PL 1.4.9
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 18 (13 by maintainers)
Hi @carmocca ,
exactly, this is the idea.
oh, I see. So the
ModelCheckpointactually does not work correctly with a static{filename}(βbestβ) (ie without the modifiers as {epoch} etc). It was always trying to override the same file. What I was seeing as new filenames was the βavoid filename clashesβ mechanism being misused there.Iβd say no, as this seems as a bug-fix for a broken functionality for
save_top_k > 1, and the naming of βnew versionsβ as{filename}-v{N}.ckptshould remain and still be used for that purpose.So Iβd keep the
-vNsuffix, and in addition newly add the_top{k}suffix. Both can be combined, so you could havebest_top2-v1.ckpt(1st copy of the 2nd best model)Iβd suggest
{filename}_top{k}.ckpt(which can be easily parsed, unlike β2nd, 3rdβ), and with additional{filename}.ckpt -> {filename}_top1.ckptfor consistency and ease of use.Furthermore,
(this might be a separate issue) but we might as well solve it here: I find this a security risk from deployment perspective! Youβd expect your latest model saved in
{filename}.ckptbut it might be saved tofilename-vN.ckptwith just a warning somewhere in the logs. Then you might easily deploy a different (old) model on production, causing real problems.For this I suggest adding an argument
automatic_version_increment_filename: bool(can be default True for consistency). If set to False, either assert or override the file if it exists.Thank you for analyzing this issue,
Hi @breznak! Thanks for the thoughtful write-up.
Correct me if Iβm wrong, you are saying that the filename of the βbestβ checkpoints does not indicate which is the best in any way. This information is only available during runtime or by looking at the logs.
And you want to be able to know which was the best checkpoint by looking at a directory of checkpoints, without loading them or checking the logs.
Our internal saving mechanism is very simple:
https://github.com/PyTorchLightning/pytorch-lightning/blob/7b4df7bf919acfd7f7b39d780faeadb54aec9ade/pytorch_lightning/callbacks/model_checkpoint.py#L696-L719
Where the
-vNsuffix is only used to avoid filename clashes, not to indicate any sort of order.I like the idea.
We could have a fixed size heap as the data structure to manage this (with
heapq.heappushpopand checking the size) If we want to avoid re-naming (up to) all checkpoints in every heap update, we could pass a tuple(monitor_value. current_filepath)as the items of the heap, andon_train_endget the ordered list of values and reconvert the current filepaths into the expected filepaths that do indicate the top-K order.Open to more ideas.
Some more thoughts:
<filename>.ckpt,<filename>_2nd.ckpt, β¦?