pytorch-lightning: Bug and question about logging -- missing epoch, validation before train?

🐛 Bug

First, the clear bug: in TrainerLoggingMixin.log_metrics() the epoch is added to the metrics variable (line 70) which is never accessed again. That should be to scalar_metrics, shouldn’t it?

Second, a question: I implemented a very primitive logger (to stdout) and logging to it. I don’t get training results when the first epoch ends until after the first epoch validation step, and consequently don’t get training metrics from the last epochs. See code and sample output below. Does this make sense?

To Reproduce

Add the following code to a Lightning Module and run a trainer with the following logger:

use_gpu = int(torch.cuda.is_available())
print_logger = PrintLogger()
trainer = Trainer(gpus=use_gpu, max_epochs=5, logger=print_logger)
trainer.fit(model)

Code sample

Minimal logging in the LightningModule:

    def training_epoch_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['acc'] for x in outputs]).mean()
        logs = dict(train_loss=avg_loss, train_acc=avg_acc)
        return dict(log=logs)

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['acc'] for x in outputs]).mean()
        logs = dict(val_loss=avg_loss, val_acc=avg_acc)
        return dict(log=logs)

A minimal logger:

from pytorch_lightning.loggers import LightningLoggerBase, rank_zero_only

class PrintLogger(LightningLoggerBase):
    
    def __init__(self):
        super(PrintLogger, self).__init__()
    
    @property
    def name(self):
        return 'Test'
    
    @property
    def experiment(self):
        return self.name()
    
    @property
    def version(self):
        return '0.0.1'
    
    @rank_zero_only
    def log_hyperparams(self, params):
        print(f'Hyperparameters:\n{params}')

    @rank_zero_only
    def log_metrics(self, metrics, step):
        if metrics is not None and len(metrics.keys()) > 0:
            print(f'{step}: {metrics}')

    def save(self):
        # Optional. Any code necessary to save logger data goes here
        pass

    @rank_zero_only
    def finalize(self, status):
        # Optional. Any code that needs to be run after training
        # finishes goes here
        pass

Expected behavior

I would expect to see the training output for each epoch followed by the validation output for each epoch, for each of the five epochs. Instead, I see the following – four train outputs and five validation ones, and seeing the validation first:

Observed behavior:

63: {'val_loss': 0.6922042369842529, 'val_acc': 0.51458740234375}
64: {'train_acc': 0.503265380859375, 'train_loss': 1.0884952545166016}
127: {'val_loss': 0.6919643878936768, 'val_acc': 0.51861572265625}
128: {'train_acc': 0.51318359375, 'train_loss': 0.6927268505096436}
191: {'val_loss': 0.6915570497512817, 'val_acc': 0.526611328125}
192: {'train_acc': 0.5161285400390625, 'train_loss': 0.6924755573272705}
255: {'val_loss': 0.6915992498397827, 'val_acc': 0.52325439453125}
256: {'train_acc': 0.5159149169921875, 'train_loss': 0.6921626329421997}
319: {'val_loss': 0.6915264129638672, 'val_acc': 0.521240234375}

Expected behavior:

Where n is the number of steps/batches per epoch:

n-1: {'train_acc': ..., 'train_loss': ...}
n-1: {'val_loss': ..., 'val_acc': ...}
2n-1 {'train_acc': ..., 'train_loss': ...}
2n-1: {'val_loss': ..., 'val_acc': ...}
3n-1 {'train_acc': ..., 'train_loss': ...}
3n-1: {'val_loss': ..., 'val_acc': ...}
...

Environment

cuda:
	GPU:
	available:           False
	version:             10.0.130
packages:
	numpy:               1.18.2
	pyTorch_debug:       False
	pyTorch_version:     1.3.1
	pytorch-lightning:   0.7.3
	tensorboard:         2.2.0
	tqdm:                4.45.0
system:
	OS:                  Linux
	architecture: 64bit
	processor:           x86_64
	python:              3.7.4
	version:             #1 SMP Tue Feb 4 23:02:59 UTC 2020

Additional context

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (14 by maintainers)

Most upvoted comments

@awaelchli, thank you for looking into it again! I agree this does look better. It’s been a while since I dug through this code, but I think I have a hunch for at least one of these issues. Note that in the second issue you point out, we’re missing the last set of train metrics, that should arrive with the validation metrics.

Reading through LightningLoggerBase, its API seems to be through the function agg_and_log_metrics, which is called from the TrainerLoggingMixin. agg_and_log_metrics calls _aggregate_metrics, which only omits metrics to log if the current step is different from the previous step. Since there’s never a call to this function after the last validation epoch, it doesn’t see a new step, and therefore never omits the last output. It looks like a call to any of save, finalize, or close should result in a call to finalize_agg_metrics, which should do the trick. So either it’s not getting called, or for some reason, it doesn’t do what it should.

The float epoch thing is probably a smaller bit. I’ll try to debug both of these later today if I have time.

guydav on Jul 15, 2020