pytorch-lightning: ModelCheckpoint filename unable to use metrics that contain a slash

๐Ÿ› Bug

ModelCheckpoint is unable to save filenames that reference a metric with a slash in their name. I use grouped metrics for tensorboard, and would like to save my files containing my loss: val/loss. However, ModelCheckpoint uses os.path.split, which splits the file name: https://github.com/PyTorchLightning/pytorch-lightning/blob/6ac0958166c66ed599c96737b587232b7a33d89e/pytorch_lightning/callbacks/model_checkpoint.py#L258

If I try to use

ModelCheckpoint("root/dir/{epoch}_{val/loss:.5f}")

The above evaluates to

self.dirpath = "root/dir/{epoch}_{val" 
self.filename = "loss:.5f}"

This inevitably causes failure when attempting to format the output path.

To Reproduce

As above, log a metric with a slash, then use it in model checkpoint output

Code sample

class Module(pl.LightningModule):
...
    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = self.loss_fn(logits, y)
        self.log('val/loss', loss, on_epoch=True)
        return loss

...
def main():
    trainer = pl.Trainer(checkpoint_callback=ModelCheckpoint("{epoch}_{val/loss:.5f}"))

Expected behavior

Split only along file path boundaries, ignoring variable names yet-to-be-formatted. Per the previous example, weโ€™d expect:

self.dirpath = "root/dir" 
self.filename = "{epoch}_{val/loss:.5f}"

Environment

  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.19.1
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0
    • pytorch-lightning: 0.10.0
    • tqdm: 4.50.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.5
    • version: #1 SMP Fri Sep 4 14:19:36 UTC 2020

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 36 (27 by maintainers)

Most upvoted comments

I can confirm this bug. I also use tensorboard for loggin and have therefore a self.log('val/accuracy', val_acc) at the end of my validation_epoch_end. I use these parameters for ModelCheckpoint

save_top_k: 3
monitor: val/accuracy
dirpath: saved_models/
filename: '{epoch}_{val/accuracy:.4f}'

and a directory called epoch=0_val is created and a checkpoint inside with the name accuracy=0.0000.ckpt I would like the checkpoint to be named epoch=0_val_accuracy=0.0000.ckpt and to be placed inside the specified dirpath in this case. How can I solve this? I am using lightning 1.0.5

So what is the solution to this issue at the moment ? I am encountering the same that @mees described, i.e. its creating folder if I use {valid/loss} like syntax.

See #6277

I think there are two workable solutions:

  1. Replace slashes with a safe character like an underscore or hyphen, as ozen suggested. Raising a warning is an option, but Iโ€™d find it annoying.
  2. Remove or make the metric name insertion optional. Give the developer precise control over exactly how their checkpoint is named.

The latter could be as simple as (at this line):

if auto_insert_metric_name:
    filename = filename.replace(group, name + "={" + name)

@its-dron exactly. Making it configurable could be ugly though. Maybe the slashes can be automatically converted to something like underscores.

#4213 doesnโ€™t fix this. Although the callback can now parse the metric names containing slashes, slashes in the resulting file name create directories during saving for obvious reasons. Since file names cannot contain slashes, the way the callback formats the file name must be changed.