pytorch-lightning: custom training with 0.8.0.dev0 gives import error
Due to another issue and the advise to upgrade to master, I upgraded to 0.8.0.dev0. Now, the same model and no code changes gives the error:
Traceback (most recent call last):
File "/home/luca/project/apps/train_siamnet.py", line 3, in <module>
from ..models.siamnet import SiameseNet
ImportError: attempted relative import with no known parent package
This did not happen before and does not make sense TBH as there is no such invalid import.
After that, the thing just hangs during GGPus initialising phase:
initializing ddp: LOCAL_RANK: 0/3 WORLD_SIZE:4
I am trying to train my model on multiple GPUs and the training code is:
model = SiameseNet(hparams)
if torch.cuda.is_available():
trainer = Trainer(gpus=-1, distributed_backend='ddp')
else:
trainer = Trainer()
trainer.fit(model)
The model def is:
class SiameseNet(pl.LightningModule):
"""
Implement a siamese network as a feature extractor withh Lightning module
"""
def __init__(self,
hparams):
"""
Build the network
"""
super(SiameseNet, self).__init__()
self.net = self._build_net()
self.hparams = hparams
self.train_data_path = hparams.get('train_data_path', None)
self.test_data_path = hparams.get('test_data_path', None)
self.val_data_path = hparams.get('val_data_path', None)
self.train_dataset = None
self.val_dataset = None
self.test_dataset = None
self.lossfn = TripletLoss(margin=1.0)
def forward_once(self, x):
output = self.net(x)
output = torch.squeeze(output)
return output
def forward(self, input1, input2, input3=None):
output1 = self.forward_once(input1)
output2 = self.forward_once(input2)
if input3 is not None:
output3 = self.forward_once(input3)
return output1, output2, output3
return output1, output2
@staticmethod
def _build_net():
net = nn.Sequential(
nn.Conv2d(3, 32,kernel_size=3,stride=2),
nn.ReLU(inplace=True),
nn.BatchNorm2d(32),
nn.Conv2d(32, 64, kernel_size=3, stride=2),
nn.ReLU(inplace=True),
nn.BatchNorm2d(64),
nn.Conv2d(64, 128, kernel_size=3, stride=2),
nn.ReLU(inplace=True),
nn.BatchNorm2d(128),
nn.Conv2d(128, 256, kernel_size=1, stride=2),
nn.ReLU(inplace=True),
nn.BatchNorm2d(256),
nn.Conv2d(256, 256, kernel_size=1, stride=2),
nn.ReLU(inplace=True),
nn.BatchNorm2d(256),
nn.Conv2d(256, 512, kernel_size=3, stride=2),
nn.ReLU(inplace=True),
nn.BatchNorm2d(512),
nn.Conv2d(512, 1024, kernel_size=1, stride=1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(1024))
return net
def prepare_data(self):
transform = torchvision.transforms.Compose([
torchvision.transforms.Resize((128, 128)),
torchvision.transforms.ColorJitter(hue=.05, saturation=.05),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.RandomRotation(20, resample=PIL.Image.BILINEAR),
torchvision.transforms.ToTensor()
])
if self.train_data_path:
train_folder_dataset = dset.ImageFolder(root=self.train_data_path)
self.train_dataset = SiameseTriplet(image_folder_dataset=train_folder_dataset,
transform=transform)
if self.val_data_path:
val_folder_dataset = dset.ImageFolder(root=self.val_data_path)
self.val_dataset = SiameseTriplet(image_folder_dataset=val_folder_dataset)
if self.test_data_path:
test_folder_dataset = dset.ImageFolder(root=self.test_data_path)
self.test_dataset = SiameseTriplet(image_folder_dataset=test_folder_dataset)
def training_step(self, batch, batch_idx):
anchor, positive, negative = batch
anchor_out, positive_out, negative_out = self.forward(anchor, positive, negative)
loss_val = self.lossfn(anchor_out, positive_out, negative_out)
return {'loss': loss_val}
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters(), lr=self.hparams.get('learning_rate', 0.001))
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
return [optimizer], [scheduler]
@pl.data_loader
def train_dataloader(self):
if self.train_dataset:
return DataLoader(self.train_dataset,
self.hparams.get('batch_size', 64),
num_workers=12)
return None
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21 (8 by maintainers)
yeah, check out the docs which explain how the new faster ddp works. https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel
@Shreeyak mind opening a new issue> this is a bit confusing is this issue refers to quite an old version…
@mmiakashs Oh nice. So I bit the bullet and converted to absolute paths but… then I now get the following error:
I am guessing this is similar to what you are experiencing? I think it seems to be having issues with the command line arguments. I do not even pass this --gpus argument. This is somehow added by PL and being passed back to the training script, which is barfing as it does not expect it.
By following @williamFalcon suggestions, I used
ddp_spawninstead ofddpmode. Without changing any of my previous code, it works perfectly with 0.8.0rc1 PL 😃 However, I am not sure about how to structure the project to use the recent fasterddpmode, which is introduced in 0.8.0rc1. Because I was facing problem to use argparser to send some arguments to the training script. If I use only one GPU in ddp mode then it was working fine. However, if I switch to mutli-GPUs then it couldn’t able to parse the arguments appropriately. As in the ddp mode the training script called multiple times, it seems to me that first time the arguments are parsed properly but the subsequent calls the arguments couldn’t able to pass properly.you can simply add extra path
or another path you need with relation to the executed file