pytorch-lightning: Mixed precision: scheduler and optimizer are called in the wrong order

🐛 Bug

When using mixed-precision training, scheduler and optimizer are called in the wrong order. Warning is generated:

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1G7pk6E9XUYq-pS41DXKhqM9Srx8sikiP?usp=sharing

There are four tests. Three of them doesn’t raise the warning:

test_amp_scheduler(precision=16, configure_optimizers=configure_optimizers_1)
test_amp_scheduler(precision=32, configure_optimizers=configure_optimizers_1)
test_amp_scheduler(precision=32, configure_optimizers=configure_optimizers_2)

This testcase raises the warning:

test_amp_scheduler(precision=16, configure_optimizers=configure_optimizers_2)

To Reproduce

Create model with configure_optimizers in a following dictionary style:

def configure_optimizers_2(model):
    optimizer = torch.optim.SGD(model.layer.parameters(), lr=0.1)
    scheduler = {'scheduler':  torch.optim.lr_scheduler.StepLR(optimizer, step_size=1),
              'name': 'learning_rate',
              'interval':'step',
              'frequency': 1}
    
    return {"optimizer": optimizer, "lr_scheduler": scheduler}

Enable mixed-precision training by setting precision=16 in a Trainer
Start training

Note

When scheduler is defined in another way, the issue seems to not occur:

def configure_optimizers_1(model):
    optimizer = torch.optim.SGD(model.layer.parameters(), lr=0.1)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
    
    return {"optimizer": optimizer, "lr_scheduler": scheduler}

Expected behavior

No warning

Environment

CUDA:
- GPU:
  - Tesla P100-PCIE-16GB
- available: True
- version: 10.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: True
- pyTorch_version: 1.7.0+cu101
- pytorch-lightning: 1.1.4
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.6.9
- version: #1 SMP Thu Jul 23 08:00:38 PDT 2020

cc @tchaton @rohitgr7 @carmocca @justusschock @awaelchli @akihironitta

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 14
Comments: 35 (11 by maintainers)

Commits related to this issue

Bugfix: Mixed precision: scheduler and optimizer are called in the wrong order (https://github.com/PyTorchLightning/pytorch-lightning/issues/5558) — committed to javierlorenzod/pytorch-lightning by javierlorenzod 3 years ago
Bugfix: Mixed precision: scheduler and optimizer are called in the wrong order (https://github.com/PyTorchLightning/pytorch-lightning/issues/5558) — committed to javierlorenzod/pytorch-lightning by javierlorenzod 3 years ago

Most upvoted comments

Following and waiting.

+10

ayansengupta17 on Mar 16, 2023

Hi @BttMA @aleSuglia The fix is still wip in #9923.

This issue only happens when

Trainer(precision=16) AND

lr_scheduler.step() runs every small steps (not epochs), i.e.

def configure_optimizers(self):
    optimizer = ...
    scheduler = {
        "scheduler": ...,
        "interval": "step",
        "frequency": 1,  # other small numbers may also cause this issue.
    }
    return {"optimizer": optimizer, "lr_scheduler": scheduler}

What’s happening is that scaler.step(optimizer) (getting called when using native amp) is likely to skip optimizer.step() for the first few iterations, and thus, it makes lr_scheduler.step() called before any call of optimizer.step().

For side note, you’ll get the same behaviour in pure PyTorch, too, as reported in “optimizer.step() before lr_scheduler.step() error using GradScaler”.

akihironitta on Nov 15, 2021

My 2 cents: Users should never get a warning when they aren’t doing anything wrong and/or there is no way for them to do something correctly. Specifically, unless this bug is fixed there is no way to run CyclicLR or OneCycleLR correctly without getting this warning.

cowwoc on Nov 17, 2021

I’m using PL pytorch-lightning==1.6.4 but still same issue

alimoezzi on Jun 2, 2022

any update?

morestart on Sep 23, 2023

pytorch==2.1.0

pytorch-lightning==2.1.0

yipliu on Oct 14, 2023

I think we should implement https://github.com/pytorch/pytorch/issues/67590 (PyTorch). Any additions in Lightning would always be workarounds.

awaelchli on Jan 10, 2023

i think this issue is with pytorch instead of pytorch lightning.

talhaanwarch on Jan 29, 2022

Same issue.

bosmart on Jan 19, 2022

Same issue here. I saw a workaround in the implementation of sentence-transformers or SBERT.

[...]
scale_before_step = scaler.get_scale()
scaler.scale(loss_value).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_grad_norm)
scaler.step(optimizer)
scaler.update()

skip_scheduler = scaler.get_scale() != scale_before_step

[...]

if not skip_scheduler:
            scheduler.step()

julianStreibel on Jan 11, 2022

@BttMA I’m sorry for your inconvenience. I’m not sure if there’s a workaround for this issue at the moment… I’ll try to have this issue resolved asap within this week and keep you updated.

akihironitta on Nov 17, 2021

Any updates on this?

aleSuglia on Sep 24, 2021

Same issue with pytorch-lightning==1.4.1.

jstremme on Aug 31, 2021

@javierlorenzod Thanks a lot for your report! Let me look into it.

akihironitta on May 17, 2021