pytorch-lightning: Step Inconsistencies

🐛 Bug

The “step” definition is super unclear and inconsistent:

global_step, it increments by the number of optimizers. Eg. for GAN it is actually 2*number_of_batches.
step as used in learning rate scheduler when interval="step" - number of training batches used.
step as used in logging - _batches_that_stepped, no idea what this is TBH? This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed. I need to call self.log("step", self.global_step) to fix wandb logging after resume.
step as max_steps in trainer, for this global_step seems to be used.

This is super convoluted to me, why can’t ‘step’ always be simply a number of dataset iterations?

Also when restoring the training, I get negative values, this is also reproduced in Colab:

To Reproduce

https://colab.research.google.com/drive/1PkMF3rOZrPU8r2BqQplfb08U8lV17Y45#scrollTo=AlOOcWzT1yAu

Notice inconsistent steps during first training run. And then completely messed up steps in the resume_from_checkpoint run - negative iteration speed, incorrect _batches_that_stepped that is not being restored correctly.

Expected behavior

Steps are consistent and restored properly (_batches_that_stepped used with wandb is not). The validation step and multiple optimizers complicates the definition of step, but whatever the definition of step you come up with should be consistent. Negative iteration speed and ETA after resume_from_checkpoint are fixed.

Thanks!

Environment

CUDA:
- GPU:
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
  - NVIDIA GeForce RTX 3090
- available: True
- version: 11.3
Packages:
- numpy: 1.23.0
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu113
- pytorch-lightning: 1.6.4
- tqdm: 4.64.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.10
- version: #54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022

cc @tchaton @justusschock @awaelchli @borda @carmocca @rohitgr7

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 6
Comments: 18 (11 by maintainers)

Most upvoted comments

Another complication is how this recent change interacts with max_steps, num_sanity_val_steps, etc, is “step” refering to the optimization step or a batch step? I would really love if we disentangled these 2 definitions for clarity. Eg by calling a commonly understood batch step a “step”, and use eg “total_optimizer_step” for the alternative definition. Currently it is super unclear which one is being used. It is like multiplying the epoch number by number of optimizers for some reason…

I can see some reasons for why global_step definition was changed, but as the user I know it will be super unclear for anybody encountering it. Suddenly when switching to GAN or multiple optimizers the behavior of global_step will change and people will need to debug why, because it is super unintuitive.

PiotrDabkowski on Jul 20, 2022

Since all of the design differences have been acknowledged already and changing them would require annoying breaking changes, I’ll go ahead and close this.

@yuchenlichuck I didn’t understand your issue, but note that optimizer.step() needs to be called for the global_step to increase. Otherwise you’ll run into https://github.com/Lightning-AI/lightning/issues/16143

carmocca on Sep 25, 2023

Thank you, total_batch_idx is what I will be using then instead of global_step, does it account for gradient acc as well? The thing that is also super confusing is the change of semantics of global_step since 1.6 - IMO this should not have happened and the “fault tolerance” could have been achieved otherwise, but ok. There is still trainer/global_step for example in wandb X axis, which adds to the confusion, it should have been trainer/total_batch_idx:

https://github.com/Lightning-AI/lightning/blob/9f51c07604f52d2e7601471b2550f64dad43aaa4/src/pytorch_lightning/loggers/wandb.py#L365

I guess, there is still some cleanup to do after 1.6 global_step definition change.

PiotrDabkowski on Jul 23, 2022

IMO this is super unintuitive, global_step should not be different from the step number being logged. Can I read somewhere about the motivation for that?

If I wanted to get total number of optimization steps, should not that be under total_optimization_steps attribute?

PiotrDabkowski on Jul 20, 2022