pytorch-lightning: using LBFGS optimizer in pytorch lightening the model is not converging as compared to native pytoch + LBFGS

Common bugs:

Comparing the results of LBFGS + Pytorch lightening to native pytorch + LBFGS, Pytorch lightening is not able to update wights and model is not converging. there are some issues to point out:

  1. Adam + Pytorch lightening on MNIST works fine, however LBFGS + Pytorch lightening is not working as expected.
  2. LBFGS + Native pytorch works very well, however when we try LBFGS + Pytorch lightening it does not work as expected.

šŸ› Bug

LBFGS + Pytorch Lightening has problem converging and weights are updating as compared to Adam + Pytorch lightening.

Code sample

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms,datasets
from torch.utils.data import DataLoader,random_split
import pytorch_lightning as pl 
from IPython.display import clear_output

class LightningMNISTClassifier(pl.LightningModule):
  def __init__(self):
    super(LightningMNISTClassifier,self).__init__()
    self.layer_1 = nn.Linear(28 * 28, 128)
    self.layer_2 = nn.Linear(128, 256)
    self.layer_3 = nn.Linear(256, 10)
    
  def forward(self, x):
    batch_size, channels, width, height = x.size()
    x=x.view(batch_size,-1)
    # layer 1
    x = self.layer_1(x)
    x = torch.relu(x)
    # layer 2
    x = self.layer_2(x)
    x = torch.relu(x) 
    # layer 3
    x = self.layer_3(x)
    # probability distribution over labels
    x = torch.log_softmax(x, dim=1)  
    return x 
  def prepare_data(self):
    transform=transforms.Compose([transforms.ToTensor(), 
                                  transforms.Normalize((0.1307,), (0.3081,))])
    # prepare transforms standard to MNIST
    mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
    mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)  
    self.mnist_train, self.mnist_val = random_split(mnist_train, [55000, 5000])

  def train_dataloader(self):
    return DataLoader(self.mnist_train,batch_size=1024)
 
  # def val_dataloader(self):
  #   return DataLoader(self.mnist_val,batch_size=1024)
  # def test_dataloader(self):
  #   return DataLoader(self.mnist_test,batch_size=1024)


  def configure_optimizers(self):
    # optimizer=optim.Adam(self.parameters(),lr=1e-3)
    optimizer = optim.LBFGS(self.parameters(), lr=1e-2)
    return optimizer

  # def backward(self, trainer, loss, optimizer):
  #   loss.backward(retain_graph=True)


  def optimizer_step(self, current_epoch, batch_nb, optimizer, optimizer_idx,
                     second_order_closure, on_tpu=False, using_native_amp=False,
                     using_lbfgs=False):
        # update params
      optimizer.step(second_order_closure) 

  def cross_entropy_loss(self,logits,labels):
    return F.nll_loss(logits,labels)

  def training_step(self,train_batch,batch_idx):
    x,y=train_batch
    logits=self.forward(x)
    loss=self.cross_entropy_loss(logits,y)
    return  {'loss':loss}

  def training_epoch_end(self,outputs):
    avg_loss=torch.stack([x['loss'] for x in outputs]).mean()
    print('epoch={}, avg_Train_loss={:.2f}'.format(self.current_epoch,avg_loss.item()))
    # return {'avg_train_loss':avg_loss}

  # def validation_step(self,val_batch,batch_idx):
  #   x,y=val_batch
  #   logits=self.forward(x)
  #   loss=self.cross_entropy_loss(logits,y)
  #   return {'val_loss':loss}
  # def validation_epoch_end(self,outputs):
  #   avg_loss=torch.stack([x['val_loss'] for x in outputs]).mean()
  #   print('epoch={}, avg_Test_loss={:.2f}'.format(self.current_epoch,avg_loss.item()))
  #   return {'avg_val_loss':avg_loss}

model=LightningMNISTClassifier()
#from pytorch_lightning.callbacks import EarlyStopping
trainer=pl.Trainer(max_epochs=400,gpus=1,
                  #  check_val_every_n_epoch=2,
                  #  accumulate_grad_batches=5,
#                   early_stop_callback=early_stop,
                  #  limit_train_batches=50,
#                   val_check_interval=0.25,
                   progress_bar_refresh_rate=0,
#                   num_sanity_val_steps=0,
                   weights_summary=None)
clear_output(wait=True)
trainer.fit(model)Preformatted text.

Expected behavior

Environment

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

Environment: -Colab and pycharm -PyTorch version: 1.6.0+CPU and GPU -pytorch-lightning==1.0.0rc3

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 25 (14 by maintainers)

Most upvoted comments

@akihironitta @carmocca I am very thankful for your great effort on this bug. I am looking forward to resuming my project as soon as you update the pl package. In my code, I like to switch between LBFGS and Adam optimizers. I like to use the LBFGS when the loss is large and then switch to Adam. I hope switching between these two optimizers would be smooth in pl (I had difficulties in switching between these two optimizers in native PyTorch). I will keep you posted if there is any problem.

As @justusschock added the tests in https://github.com/PyTorchLightning/pytorch-lightning/pull/4190 and I confirmed locally with cProfile, the number of backward passes (the number of times closure was called) in PL is 20 which is the same as native PyTorch, so this should be no problem.

PL code example (originally from @peymanpoozesh)
import os
import warnings

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from torchvision.datasets import MNIST

import pytorch_lightning as pl

warnings.filterwarnings("ignore")
pl.seed_everything(42)


class LightningMNISTClassifier(pl.LightningModule):
    def __init__(self):
        super(LightningMNISTClassifier, self).__init__()
        self.layer_1 = nn.Linear(28 * 28, 128)
        self.layer_2 = nn.Linear(128, 256)
        self.layer_3 = nn.Linear(256, 10)

    def forward(self, x):
        batch_size, channels, width, height = x.size()
        x = x.view(batch_size, -1)
        x = self.layer_1(x)
        x = torch.relu(x)
        x = self.layer_2(x)
        x = torch.relu(x)
        x = self.layer_3(x)
        x = torch.log_softmax(x, dim=1)
        return x

    def prepare_data(self):
        transform = transforms.Compose(
            [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
        )
        mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
        self.mnist_train, self.mnist_val = random_split(
            mnist_train, [55000, 5000], generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        dl = DataLoader(self.mnist_train, batch_size=1024, num_workers=0)
        return dl

    def configure_optimizers(self):
        # optimizer = optim.Adam(self.parameters(), lr=1e-3)
        optimizer = optim.LBFGS(self.parameters(), lr=0.01, max_iter=20)
        return optimizer

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = F.nll_loss(logits, y)
        return {"loss": loss}

    def training_step_end(self, outputs):
        print("closure_loss:", outputs["loss"].item())
        return outputs


def main():
    model = LightningMNISTClassifier()
    trainer = pl.Trainer(
        max_epochs=30,
        progress_bar_refresh_rate=0,
        weights_summary=None,
        # fast_dev_run=20,
    )
    trainer.fit(model)


if __name__ == "__main__":
    main()
native PyTorch code example (originally from @peymanpoozesh)
import os
import warnings

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from torchvision.datasets import MNIST

from pytorch_lightning import seed_everything

warnings.filterwarnings("ignore")
seed_everything(42)


class PytorchMNISTClassifier(nn.Module):
    def __init__(self):
        super(PytorchMNISTClassifier, self).__init__()
        self.layer_1 = nn.Linear(28 * 28, 128)
        self.layer_2 = nn.Linear(128, 256)
        self.layer_3 = nn.Linear(256, 10)

    def forward(self, x):
        batch_size, channels, width, height = x.size()
        x = x.view(batch_size, -1)
        x = self.layer_1(x)
        x = torch.relu(x)
        x = self.layer_2(x)
        x = torch.relu(x)
        x = self.layer_3(x)
        x = torch.log_softmax(x, dim=1)
        return x


def main():
    device = torch.device("cpu")
    model = PytorchMNISTClassifier().to(device)

    # optimizer=optim.Adam(model.parameters(),lr=1e-3)
    optimizer = optim.LBFGS(model.parameters(), lr=0.01, max_iter=20)

    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    )

    mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
    mnist_train, mnist_val = random_split(
        mnist_train, [55000, 5000], generator=torch.Generator().manual_seed(42)
    )

    dl = DataLoader(mnist_train, batch_size=1024, num_workers=0)

    for epoch in range(30):
        for i, (x, y) in enumerate(dl):
            x = x.to(device)
            y = y.to(device)

            def closure():
                logits = model(x)
                optimizer.zero_grad()
                loss = F.nll_loss(logits, y)
                loss.backward(retain_graph=True)
                print("closure_loss:", loss.item())
                return loss

            loss_out = optimizer.step(closure=closure)


if __name__ == "__main__":
    main()
my env
$ wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
$ python collect_env_details.py
* CUDA:
	- GPU:
	- available:         False
	- version:           None
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1+cpu
	- pytorch-lightning: 1.1.4
	- tqdm:              4.56.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         
	- python:            3.8.5
	- version:           #1 SMP Debian 4.19.160-2 (2020-11-28)

I have no idea how I could investigate this further. @carmocca @rohitgr7 Could you help here if you have time…?


EDIT (Jan 28, 2021): Not sure how this helps us debug, but I realised that if we change the value of torch.optim.LBFGS(..., max_iter=20) from 20 (by default) to 1 or 2, both PL and native PyTorch behave exactly the same which I confirmed with my example code above. (Both don’t converge though.)

@justusschock Fixed! (It was just for print debugging from another script because LightningOptimizer doesn’t return the output of closure())

@akihironitta Why doesn’t optimizer.step(closure=closure) work? Why do you have to unwrap it? Because without unwrapping you also get all the precision support from lightning 😃

Apologize for the delay! We try our best to take a look at every issue with the resources that we have. We bumped the priority for this one and will try to prioritize in the next sprints!

@williamFalcon @Borda @edenlightning Since this thread will be closed automatically within the next 48 hours, I decided to mention you guys with the hope that the bug gets fixed in a meaningful period. I also appreciate @justusschock for his efforts to fix the issue. Ignoring a bug will not fix it, and it dramatically stops the research activities of people who trusted lightning. Please help us with fixing the bug.

ok will check this if I get some time 😃

this is the code including MNIST and LBFGS that works fine with native pytorch:

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms,datasets
from torch.utils.data import DataLoader,random_split


class PytorchMNISTClassifier(nn.Module):
  def __init__(self):
    super(PytorchMNISTClassifier,self).__init__()
    self.layer_1 = nn.Linear(28 * 28, 128)
    self.layer_2 = nn.Linear(128, 256)
    self.layer_3 = nn.Linear(256, 10)
  def forward(self, x):
    batch_size, channels, width, height = x.size()
    x=x.view(batch_size,-1)
    # layer 1
    x = self.layer_1(x)
    x = torch.relu(x)
    # layer 2
    x = self.layer_2(x)
    x = torch.relu(x) 
    # layer 3
    x = self.layer_3(x)
    # probability distribution over labels
    x = torch.log_softmax(x, dim=1)  
    return x 

def cross_entropy_loss(logits,labels):
  return F.nll_loss(logits,labels)

if __name__ == '__main__':

  if torch.cuda.is_available():
    device=torch.device('cuda:0')
  else:
    device=torch.device('cpu')

  model=PytorchMNISTClassifier()
  model=model.to(device)
  # optimizer=optim.Adam(model.parameters(),lr=1e-3)
  optimizer = optim.LBFGS(model.parameters(),lr=0.01)

  transform=transforms.Compose([transforms.ToTensor(), 
                                  transforms.Normalize((0.1307,), (0.3081,))])
  # prepare transforms standard to MNIST
  mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
  mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)  
  mnist_train, mnist_val = random_split(mnist_train, [55000, 5000])

  data=DataLoader(mnist_train,batch_size=1024)

  for Epoch in range(10):
    loss_total=0.
    for i,(x,y) in enumerate(data):
      x=x.to(device)
      y=y.to(device)
      def closure():
        logits=model(x)
        optimizer.zero_grad()
        loss=cross_entropy_loss(logits,y)
        loss.backward(retain_graph=True)
        return loss
    loss_out = optimizer.step(closure)
    loss_total+=loss_out.item()
    print('total_loss--->', loss_total)