pytorch-lightning: custom training with 0.8.0.dev0 gives import error

Due to another issue and the advise to upgrade to master, I upgraded to 0.8.0.dev0. Now, the same model and no code changes gives the error:

Traceback (most recent call last):
  File "/home/luca/project/apps/train_siamnet.py", line 3, in <module>
    from ..models.siamnet import SiameseNet
ImportError: attempted relative import with no known parent package

This did not happen before and does not make sense TBH as there is no such invalid import.

After that, the thing just hangs during GGPus initialising phase:

initializing ddp: LOCAL_RANK: 0/3 WORLD_SIZE:4

I am trying to train my model on multiple GPUs and the training code is:

model = SiameseNet(hparams)
if torch.cuda.is_available():
        trainer = Trainer(gpus=-1, distributed_backend='ddp')
else:
        trainer = Trainer()

trainer.fit(model)

The model def is:

class SiameseNet(pl.LightningModule):
    """
    Implement a siamese network as a feature extractor withh Lightning module
    """
    def __init__(self,
                 hparams):
        """
        Build the network
        """
        super(SiameseNet, self).__init__()
        self.net = self._build_net()
        self.hparams = hparams
        self.train_data_path = hparams.get('train_data_path', None)
        self.test_data_path = hparams.get('test_data_path', None)
        self.val_data_path = hparams.get('val_data_path', None)
        self.train_dataset = None
        self.val_dataset = None
        self.test_dataset = None

        self.lossfn = TripletLoss(margin=1.0)

    def forward_once(self, x):
        output = self.net(x)
        output = torch.squeeze(output)
        return output

    def forward(self, input1, input2, input3=None):
        output1 = self.forward_once(input1)
        output2 = self.forward_once(input2)

        if input3 is not None:
            output3 = self.forward_once(input3)
            return output1, output2, output3

        return output1, output2

    @staticmethod
    def _build_net():
        net = nn.Sequential(
            nn.Conv2d(3, 32,kernel_size=3,stride=2),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(32),

            nn.Conv2d(32, 64, kernel_size=3, stride=2),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(64),

            nn.Conv2d(64, 128, kernel_size=3, stride=2),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(128),

            nn.Conv2d(128, 256, kernel_size=1, stride=2),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(256),

            nn.Conv2d(256, 256, kernel_size=1, stride=2),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(256),

            nn.Conv2d(256, 512, kernel_size=3, stride=2),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(512),

            nn.Conv2d(512, 1024, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(1024))

        return net

    def prepare_data(self):
        transform = torchvision.transforms.Compose([
            torchvision.transforms.Resize((128, 128)),
            torchvision.transforms.ColorJitter(hue=.05, saturation=.05),
            torchvision.transforms.RandomHorizontalFlip(),
            torchvision.transforms.RandomRotation(20, resample=PIL.Image.BILINEAR),
            torchvision.transforms.ToTensor()
        ])

        if self.train_data_path:
            train_folder_dataset = dset.ImageFolder(root=self.train_data_path)
            self.train_dataset = SiameseTriplet(image_folder_dataset=train_folder_dataset,
                                                transform=transform)
        if self.val_data_path:
            val_folder_dataset = dset.ImageFolder(root=self.val_data_path)
            self.val_dataset = SiameseTriplet(image_folder_dataset=val_folder_dataset)

        if self.test_data_path:
            test_folder_dataset = dset.ImageFolder(root=self.test_data_path)
            self.test_dataset = SiameseTriplet(image_folder_dataset=test_folder_dataset)

    def training_step(self, batch, batch_idx):
        anchor, positive, negative = batch
        anchor_out, positive_out, negative_out = self.forward(anchor, positive, negative)
        loss_val = self.lossfn(anchor_out, positive_out, negative_out)
        return {'loss': loss_val}

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.hparams.get('learning_rate', 0.001))
        scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
        return [optimizer], [scheduler]

    @pl.data_loader
    def train_dataloader(self):
        if self.train_dataset:
            return DataLoader(self.train_dataset,
                              self.hparams.get('batch_size', 64),
                              num_workers=12)
        return None

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (8 by maintainers)

Most upvoted comments

yeah, check out the docs which explain how the new faster ddp works. https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel

image

@Shreeyak mind opening a new issue> this is a bit confusing is this issue refers to quite an old version…

@mmiakashs Oh nice. So I bit the bullet and converted to absolute paths but… then I now get the following error:

train_siamnet.py: error: unrecognized arguments: --gpus 4

I am guessing this is similar to what you are experiencing? I think it seems to be having issues with the command line arguments. I do not even pass this --gpus argument. This is somehow added by PL and being passed back to the training script, which is barfing as it does not expect it.

By following @williamFalcon suggestions, I used ddp_spawn instead of ddp mode. Without changing any of my previous code, it works perfectly with 0.8.0rc1 PL 😃 However, I am not sure about how to structure the project to use the recent faster ddp mode, which is introduced in 0.8.0rc1. Because I was facing problem to use argparser to send some arguments to the training script. If I use only one GPU in ddp mode then it was working fine. However, if I switch to mutli-GPUs then it couldn’t able to parse the arguments appropriately. As in the ddp mode the training script called multiple times, it seems to me that first time the arguments are parsed properly but the subsequent calls the arguments couldn’t able to pass properly.

you can simply add extra path

import sys, os
sys.path += [os.path.abspath('..')]

or another path you need with relation to the executed file

sys.path += [os.path.dictionary(__file__)]