pytorch-lightning: Error when disabling an optimizer with native AMP turned on

๐Ÿ› Bug

When running my Lightning code with:

  • fp16 native AMP
  • Multiple optimizers
  • One of the optimizers disabled (in this case by returning None for it in training_step)

Iโ€™m getting the following stacktrace:

Traceback (most recent call last):
  File "./train_stage1.py", line 353, in <module>
    trainer.fit(model)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
    results = self.trainer.train()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in train
    self.train_loop.run_training_epoch()
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 544, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 713, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 453, in optimizer_step
    optimizer, batch_idx, opt_idx, train_step_and_backward_closure
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 122, in optimizer_step
    using_lbfgs=is_lbfgs
  File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1209, in optimizer_step
    self.trainer.scaler.step(optimizer)
  File "/home/wj359634/venv/lib64/python3.6/site-packages/torch/cuda/amp/grad_scaler.py", line 318, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1293, in close
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1471, in display
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: 'NoneType' object is not iterable

To Reproduce

(Iโ€™m hoping those are all conditions that have to be met) Run a Lightning model with

  • fp16 native AMP
  • Multiple optimizers
  • One of the optimizers disabled (in this case by returning None for it in training_step)

Expected behavior

The code should skip this optimizer

Environment

* CUDA:
        - GPU:
                - Tesla V100-PCIE-32GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.18.4
        - pyTorch_debug:     True
        - pyTorch_version:   1.7.0
        - pytorch-lightning: 1.0.4
        - tqdm:              4.46.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.6.8
        - version:           #1 SMP Tue Aug 25 17:23:54 UTC 2020

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (17 by maintainers)

Most upvoted comments

@edenlightning It doesnโ€™t feel to me like this issue is resolved, why are you closing it? Is the recommended solution (for now or in general) to use the manual optimization route when those specific conditions are met? It still isnโ€™t very clear to me how to train things like GANs with AMP in Lightning now.

@Borda I believe issue is with

https://github.com/PyTorchLightning/pytorch-lightning/blob/e81707ba0242f12f47d742e86a982f529a7ae65b/pytorch_lightning/core/lightning.py#L1229

being called (and optimizer being called in general) when the training_step returned None - the check to skip an optimizer (https://github.com/PyTorchLightning/pytorch-lightning/blob/e81707ba0242f12f47d742e86a982f529a7ae65b/pytorch_lightning/trainer/training_loop.py#L716) does it after calling optimizer_step

please @ me if you find a solution, I probably need to hotfix it to resume my research.