pytorch-lightning: Step Inconsistencies

šŸ› Bug

The ā€œstepā€ definition is super unclear and inconsistent:

  • global_step, it increments by the number of optimizers. Eg. for GAN it is actually 2*number_of_batches.
  • step as used in learning rate scheduler when interval="step" - number of training batches used.
  • step as used in logging - _batches_that_stepped, no idea what this is TBH? This cases issues when restoring and logging to eg wandb, the metrics are logged from step 0, rather than resumed. I need to call self.log("step", self.global_step) to fix wandb logging after resume.
  • step as max_steps in trainer, for this global_step seems to be used.

This is super convoluted to me, why can’t ā€˜step’ always be simply a number of dataset iterations?

Also when restoring the training, I get negative values, this is also reproduced in Colab: image

To Reproduce

https://colab.research.google.com/drive/1PkMF3rOZrPU8r2BqQplfb08U8lV17Y45#scrollTo=AlOOcWzT1yAu

Notice inconsistent steps during first training run. And then completely messed up steps in the resume_from_checkpoint run - negative iteration speed, incorrect _batches_that_stepped that is not being restored correctly.

image

Expected behavior

Steps are consistent and restored properly (_batches_that_stepped used with wandb is not). The validation step and multiple optimizers complicates the definition of step, but whatever the definition of step you come up with should be consistent. Negative iteration speed and ETA after resume_from_checkpoint are fixed.

Thanks!

Environment

  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
      • NVIDIA GeForce RTX 3090
    • available: True
    • version: 11.3
  • Packages:
    • numpy: 1.23.0
    • pyTorch_debug: False
    • pyTorch_version: 1.11.0+cu113
    • pytorch-lightning: 1.6.4
    • tqdm: 4.64.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.10
    • version: #54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022

cc @tchaton @justusschock @awaelchli @borda @carmocca @rohitgr7

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 6
  • Comments: 18 (11 by maintainers)

Most upvoted comments

Another complication is how this recent change interacts with max_steps, num_sanity_val_steps, etc, is ā€œstepā€ refering to the optimization step or a batch step? I would really love if we disentangled these 2 definitions for clarity. Eg by calling a commonly understood batch step a ā€œstepā€, and use eg ā€œtotal_optimizer_stepā€ for the alternative definition. Currently it is super unclear which one is being used. It is like multiplying the epoch number by number of optimizers for some reason…

I can see some reasons for why global_step definition was changed, but as the user I know it will be super unclear for anybody encountering it. Suddenly when switching to GAN or multiple optimizers the behavior of global_step will change and people will need to debug why, because it is super unintuitive.

Since all of the design differences have been acknowledged already and changing them would require annoying breaking changes, I’ll go ahead and close this.

@yuchenlichuck I didn’t understand your issue, but note that optimizer.step() needs to be called for the global_step to increase. Otherwise you’ll run into https://github.com/Lightning-AI/lightning/issues/16143

Thank you, total_batch_idx is what I will be using then instead of global_step, does it account for gradient acc as well? The thing that is also super confusing is the change of semantics of global_step since 1.6 - IMO this should not have happened and the ā€œfault toleranceā€ could have been achieved otherwise, but ok. There is still trainer/global_step for example in wandb X axis, which adds to the confusion, it should have been trainer/total_batch_idx:

image

https://github.com/Lightning-AI/lightning/blob/9f51c07604f52d2e7601471b2550f64dad43aaa4/src/pytorch_lightning/loggers/wandb.py#L365

I guess, there is still some cleanup to do after 1.6 global_step definition change.

IMO this is super unintuitive, global_step should not be different from the step number being logged. Can I read somewhere about the motivation for that?

If I wanted to get total number of optimization steps, should not that be under total_optimization_steps attribute?