pytorch-lightning: ModelCheckpoint filename unable to use metrics that contain a slash
๐ Bug
ModelCheckpoint is unable to save filenames that reference a metric with a slash in their name. I use grouped metrics for tensorboard, and would like to save my files containing my loss: val/loss. However, ModelCheckpoint uses os.path.split, which splits the file name: https://github.com/PyTorchLightning/pytorch-lightning/blob/6ac0958166c66ed599c96737b587232b7a33d89e/pytorch_lightning/callbacks/model_checkpoint.py#L258
If I try to use
ModelCheckpoint("root/dir/{epoch}_{val/loss:.5f}")
The above evaluates to
self.dirpath = "root/dir/{epoch}_{val"
self.filename = "loss:.5f}"
This inevitably causes failure when attempting to format the output path.
To Reproduce
As above, log a metric with a slash, then use it in model checkpoint output
Code sample
class Module(pl.LightningModule):
...
def validation_step(self, batch, batch_idx):
x, y = batch
logits = self.forward(x)
loss = self.loss_fn(logits, y)
self.log('val/loss', loss, on_epoch=True)
return loss
...
def main():
trainer = pl.Trainer(checkpoint_callback=ModelCheckpoint("{epoch}_{val/loss:.5f}"))
Expected behavior
Split only along file path boundaries, ignoring variable names yet-to-be-formatted. Per the previous example, weโd expect:
self.dirpath = "root/dir"
self.filename = "{epoch}_{val/loss:.5f}"
Environment
- CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.10.0
- tqdm: 4.50.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: #1 SMP Fri Sep 4 14:19:36 UTC 2020
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 36 (27 by maintainers)
I can confirm this bug. I also use tensorboard for loggin and have therefore a
self.log('val/accuracy', val_acc)at the end of myvalidation_epoch_end. I use these parameters for ModelCheckpointand a directory called
epoch=0_valis created and a checkpoint inside with the nameaccuracy=0.0000.ckptI would like the checkpoint to be namedepoch=0_val_accuracy=0.0000.ckptand to be placed inside the specified dirpath in this case. How can I solve this? I am using lightning 1.0.5See #6277
I think there are two workable solutions:
The latter could be as simple as (at this line):
@its-dron exactly. Making it configurable could be ugly though. Maybe the slashes can be automatically converted to something like underscores.
#4213 doesnโt fix this. Although the callback can now parse the metric names containing slashes, slashes in the resulting file name create directories during saving for obvious reasons. Since file names cannot contain slashes, the way the callback formats the file name must be changed.