pytorch-lightning: Error when disabling an optimizer with native AMP turned on
๐ Bug
When running my Lightning code with:
- fp16 native AMP
- Multiple optimizers
- One of the optimizers disabled (in this case by returning
Nonefor it intraining_step)
Iโm getting the following stacktrace:
Traceback (most recent call last):
File "./train_stage1.py", line 353, in <module>
trainer.fit(model)
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
results = self.trainer.train()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in train
self.train_loop.run_training_epoch()
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 544, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 713, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 453, in optimizer_step
optimizer, batch_idx, opt_idx, train_step_and_backward_closure
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 122, in optimizer_step
using_lbfgs=is_lbfgs
File "/home/wj359634/venv/lib64/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1209, in optimizer_step
self.trainer.scaler.step(optimizer)
File "/home/wj359634/venv/lib64/python3.6/site-packages/torch/cuda/amp/grad_scaler.py", line 318, in step
assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1086, in __del__
File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1293, in close
File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1471, in display
File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1089, in __repr__
File "/home/wj359634/venv/lib64/python3.6/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: 'NoneType' object is not iterable
To Reproduce
(Iโm hoping those are all conditions that have to be met) Run a Lightning model with
- fp16 native AMP
- Multiple optimizers
- One of the optimizers disabled (in this case by returning
Nonefor it intraining_step)
Expected behavior
The code should skip this optimizer
Environment
* CUDA:
- GPU:
- Tesla V100-PCIE-32GB
- available: True
- version: 10.2
* Packages:
- numpy: 1.18.4
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.0.4
- tqdm: 4.46.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.6.8
- version: #1 SMP Tue Aug 25 17:23:54 UTC 2020
Additional context
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (17 by maintainers)
@edenlightning It doesnโt feel to me like this issue is resolved, why are you closing it? Is the recommended solution (for now or in general) to use the manual optimization route when those specific conditions are met? It still isnโt very clear to me how to train things like GANs with AMP in Lightning now.
@Borda I believe issue is with
https://github.com/PyTorchLightning/pytorch-lightning/blob/e81707ba0242f12f47d742e86a982f529a7ae65b/pytorch_lightning/core/lightning.py#L1229
being called (and optimizer being called in general) when the training_step returned
None- the check to skip an optimizer (https://github.com/PyTorchLightning/pytorch-lightning/blob/e81707ba0242f12f47d742e86a982f529a7ae65b/pytorch_lightning/trainer/training_loop.py#L716) does it after calling optimizer_stepplease @ me if you find a solution, I probably need to hotfix it to resume my research.