pytorch-lightning: DDP is not working with Pytorch Lightning

I am using DDP in a single machine with 2 GPUs. when I am running the code it stuck forever with the below script. The code is working properly with dp and also with ddp using a single GPU.

GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Using native 16bit precision. LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] Using native 16bit precision. initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

To reproduce I used the boring code which exactly produce the above script:

import os
import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    # test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        gpus=2,
        distributed_backend='ddp',
    )
    trainer.fit(model, train_data, val_data)
    # trainer.test(model,test_data)

if __name__ == "__main__":
    run()

I ran the code with NCCL_DEBUG=INFO and raised with errors below:

u116642:82796:82796 [0] NCCL INFO Bootstrap : Using enp68s0:35.9.130.234<0>
u116642:82796:82796 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal
implementation
u116642:82796:82796 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u116642:82796:82796 [0] NCCL INFO NET/Socket : Using [0]enp68s0:35.9.130.234<0>
[1]veth07bbc25:fe80::5897:33ff:fe5e:15ee%veth07bbc25<0>
[2]veth2d1e326:fe80::54c0:1cff:fe70:8b39%veth2d1e326<0>
u116642:82796:82796 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
u116642:82845:82845 [1] NCCL INFO Bootstrap : Using enp68s0:35.9.130.234<0>
u116642:82845:82845 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal
implementation
u116642:82845:82845 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u116642:82845:82845 [1] NCCL INFO NET/Socket : Using [0]enp68s0:35.9.130.234<0>
[1]veth07bbc25:fe80::5897:33ff:fe5e:15ee%veth07bbc25<0>
[2]veth2d1e326:fe80::54c0:1cff:fe70:8b39%veth2d1e326<0>
u116642:82845:82845 [1] NCCL INFO Using network Socket
u116642:82796:82908 [0] NCCL INFO Channel 00/04 : 0 1
u116642:82845:82930 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3]
0/-1/-1->1->-1
u116642:82796:82908 [0] NCCL INFO Channel 01/04 : 0 1
u116642:82796:82908 [0] NCCL INFO Channel 02/04 : 0 1
u116642:82796:82908 [0] NCCL INFO Channel 03/04 : 0 1
u116642:82796:82908 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3]
-1/-1/-1->0->1
u116642:82796:82908 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff
u116642:82796:82908 [0] NCCL INFO Channel 00 : 0[1000] -> 1[21000] via P2P/IPC
u116642:82796:82908 [0] NCCL INFO Channel 01 : 0[1000] -> 1[21000] via P2P/IPC
u116642:82796:82908 [0] NCCL INFO Channel 02 : 0[1000] -> 1[21000] via P2P/IPC
u116642:82796:82908 [0] NCCL INFO Channel 03 : 0[1000] -> 1[21000] via P2P/IPC
u116642:82845:82930 [1] NCCL INFO Channel 00 : 1[21000] -> 0[1000] via P2P/IPC
u116642:82845:82930 [1] NCCL INFO Channel 01 : 1[21000] -> 0[1000] via P2P/IPC
u116642:82845:82930 [1] NCCL INFO Channel 02 : 1[21000] -> 0[1000] via P2P/IPC
u116642:82845:82930 [1] NCCL INFO Channel 03 : 1[21000] -> 0[1000] via P2P/IPC
u116642:82796:82908 [0] NCCL INFO Connected all rings
u116642:82796:82908 [0] NCCL INFO Connected all trees
u116642:82796:82908 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
u116642:82796:82908 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
u116642:82845:82930 [1] NCCL INFO Connected all rings
u116642:82845:82930 [1] NCCL INFO Connected all trees
u116642:82845:82930 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
u116642:82845:82930 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
u116642:82796:82908 [0] NCCL INFO comm 0x7f30ac002fb0 rank 0 nranks 2 cudaDev 0 busId 1000 -
Init COMPLETE
u116642:82796:82796 [0] NCCL INFO Launch mode Parallel

u116642:82845:82930 [1] NCCL INFO comm 0x7f8578002fb0 rank 1 nranks 2 cudaDev 1 busId 21000 -
Init COMPLETE

PyTorch Version :1.9 OS: Linux PyTorch installed with pip Python lightning version: 1.0.0 CUDA/cuDNN version: 11.4 GPU models and configuration: 2 GPU of A 6000

cc @tchaton @justusschock

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 20 (9 by maintainers)

Most upvoted comments

Any updates since? I’m having the same problem with pl==1.5.10 (latest) / 1.4.2 and with pytorch==1.7.1 / 1.8.0, cuda==11.1

@awaelchli, I mentioned in my comment that I used the latest version of PyTorch lightning and still the problem exists.

That does not seem to be the case. The environment collection script still detects 1.0.8. It’s right there in your post.

@H-B-L pytorch-lightning==1.0.0 is not tested against torch==1.9 which I checked in the release tag: https://github.com/PyTorchLightning/pytorch-lightning/tree/1.0.0#continuous-integration.

In addition, 1.0.0 was released more than a year ago, and there are quite a number of bugfixes/improvements since then, so would you mind updating the version and giving it a try again? You can upgrade the version by running pip install -U pytorch-lightning.

@tchaton, I didn’t run the code within a docker image yet. I will try that.

Back to latest PyTorch lightning and switching the torch backend from ‘nccl’ to ‘gloo’ worked for me. But it seems ‘gloo’ backend is slower than ‘nccl’. Any other ideas to use ‘nccl’ without the issue? Seems PyTorch lightning has this issue for some specific GPUs. Bunch of users have the same problem. Check out the #4612.