pytorch-lightning: Step Inconsistencies
š Bug
The āstepā definition is super unclear and inconsistent:
global_step, it increments by the number of optimizers. Eg. for GAN it is actually2*number_of_batches.stepas used in learning rate scheduler wheninterval="step"- number of training batches used.stepas used in logging -_batches_that_stepped, no idea what this is TBH? This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed. I need to callself.log("step", self.global_step)to fix wandb logging after resume.stepasmax_stepsin trainer, for thisglobal_stepseems to be used.
This is super convoluted to me, why canāt āstepā always be simply a number of dataset iterations?
Also when restoring the training, I get negative values, this is also reproduced in Colab:

To Reproduce
https://colab.research.google.com/drive/1PkMF3rOZrPU8r2BqQplfb08U8lV17Y45#scrollTo=AlOOcWzT1yAu
Notice inconsistent steps during first training run.
And then completely messed up steps in the resume_from_checkpoint run - negative iteration speed, incorrect _batches_that_stepped that is not being restored correctly.
Expected behavior
Steps are consistent and restored properly (_batches_that_stepped used with wandb is not). The validation step and multiple optimizers complicates the definition of step, but whatever the definition of step you come up with should be consistent.
Negative iteration speed and ETA after resume_from_checkpoint are fixed.
Thanks!
Environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- available: True
- version: 11.3
- GPU:
- Packages:
- numpy: 1.23.0
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu113
- pytorch-lightning: 1.6.4
- tqdm: 4.64.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.10
- version:
#54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022
cc @tchaton @justusschock @awaelchli @borda @carmocca @rohitgr7
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 6
- Comments: 18 (11 by maintainers)
Another complication is how this recent change interacts with max_steps, num_sanity_val_steps, etc, is āstepā refering to the optimization step or a batch step? I would really love if we disentangled these 2 definitions for clarity. Eg by calling a commonly understood batch step a āstepā, and use eg ātotal_optimizer_stepā for the alternative definition. Currently it is super unclear which one is being used. It is like multiplying the epoch number by number of optimizers for some reasonā¦
I can see some reasons for why
global_stepdefinition was changed, but as the user I know it will be super unclear for anybody encountering it. Suddenly when switching to GAN or multiple optimizers the behavior of global_step will change and people will need to debug why, because it is super unintuitive.Since all of the design differences have been acknowledged already and changing them would require annoying breaking changes, Iāll go ahead and close this.
@yuchenlichuck I didnāt understand your issue, but note that
optimizer.step()needs to be called for theglobal_stepto increase. Otherwise youāll run into https://github.com/Lightning-AI/lightning/issues/16143Thank you,
total_batch_idxis what I will be using then instead ofglobal_step, does it account for gradient acc as well? The thing that is also super confusing is the change of semantics ofglobal_stepsince 1.6 - IMO this should not have happened and the āfault toleranceā could have been achieved otherwise, but ok. There is stilltrainer/global_stepfor example in wandb X axis, which adds to the confusion, it should have beentrainer/total_batch_idx:https://github.com/Lightning-AI/lightning/blob/9f51c07604f52d2e7601471b2550f64dad43aaa4/src/pytorch_lightning/loggers/wandb.py#L365
I guess, there is still some cleanup to do after 1.6
global_stepdefinition change.IMO this is super unintuitive, global_step should not be different from the step number being logged. Can I read somewhere about the motivation for that?
If I wanted to get total number of optimization steps, should not that be under
total_optimization_stepsattribute?