pytorch-lightning: DDP initialization breaks on HPC system at larger scale

We are experiencing an issue with DDP at larger scales on our HPC system (Summit at OLCF - LSF scheduler). The specific threshold is at 14 nodes, where things suddenly aren’t able to initialize anymore. It appears there is all of a sudden an inability to setup ranks across nodes properly, as depicted below in the output.

Each node as 6 GPUs. So, in total, we are trying to use 84 GPUs when things are suddenly unable to initialize. At 78 GPUs (or 13 nodes) things work as expected.

Initialization output at 13 nodes (78 GPUs - this works as expected):

initializing ddp: GLOBAL_RANK: 72, MEMBER: 73/78
initializing ddp: GLOBAL_RANK: 36, MEMBER: 37/78
initializing ddp: GLOBAL_RANK: 10, MEMBER: 11/78
initializing ddp: GLOBAL_RANK: 19, MEMBER: 20/78
initializing ddp: GLOBAL_RANK: 32, MEMBER: 33/78
initializing ddp: GLOBAL_RANK: 12, MEMBER: 13/78
initializing ddp: GLOBAL_RANK: 41, MEMBER: 42/78
initializing ddp: GLOBAL_RANK: 74, MEMBER: 75/78
initializing ddp: GLOBAL_RANK: 27, MEMBER: 28/78
initializing ddp: GLOBAL_RANK: 21, MEMBER: 22/78
initializing ddp: GLOBAL_RANK: 13, MEMBER: 14/78
initializing ddp: GLOBAL_RANK: 73, MEMBER: 74/78
initializing ddp: GLOBAL_RANK: 25, MEMBER: 26/78
initializing ddp: GLOBAL_RANK: 24, MEMBER: 25/78
initializing ddp: GLOBAL_RANK: 28, MEMBER: 29/78
initializing ddp: GLOBAL_RANK: 67, MEMBER: 68/78
initializing ddp: GLOBAL_RANK: 69, MEMBER: 70/78
initializing ddp: GLOBAL_RANK: 68, MEMBER: 69/78
initializing ddp: GLOBAL_RANK: 47, MEMBER: 48/78
initializing ddp: GLOBAL_RANK: 56, MEMBER: 57/78
initializing ddp: GLOBAL_RANK: 71, MEMBER: 72/78
initializing ddp: GLOBAL_RANK: 53, MEMBER: 54/78
initializing ddp: GLOBAL_RANK: 66, MEMBER: 67/78
initializing ddp: GLOBAL_RANK: 70, MEMBER: 71/78
initializing ddp: GLOBAL_RANK: 55, MEMBER: 56/78
initializing ddp: GLOBAL_RANK: 49, MEMBER: 50/78
initializing ddp: GLOBAL_RANK: 44, MEMBER: 45/78
initializing ddp: GLOBAL_RANK: 46, MEMBER: 47/78
initializing ddp: GLOBAL_RANK: 57, MEMBER: 58/78
initializing ddp: GLOBAL_RANK: 51, MEMBER: 52/78
initializing ddp: GLOBAL_RANK: 45, MEMBER: 46/78
initializing ddp: GLOBAL_RANK: 54, MEMBER: 55/78
initializing ddp: GLOBAL_RANK: 50, MEMBER: 51/78
initializing ddp: GLOBAL_RANK: 59, MEMBER: 60/78
initializing ddp: GLOBAL_RANK: 52, MEMBER: 53/78
initializing ddp: GLOBAL_RANK: 43, MEMBER: 44/78
initializing ddp: GLOBAL_RANK: 58, MEMBER: 59/78
initializing ddp: GLOBAL_RANK: 42, MEMBER: 43/78
initializing ddp: GLOBAL_RANK: 48, MEMBER: 49/78
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 78 processes

Failed initialization at 14 nodes (84 GPUs - this hangs at this point):

initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84

Here is the code:

class LitAutoEncoder(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 64),
            nn.ReLU(),
            nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64),
            nn.ReLU(),
            nn.Linear(64, 28*28)
        )

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defined the train loop.
        # It is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)

        # Logging to TensorBoard by default
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# define datasets/dataloaders
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train_dl = DataLoader(dataset)


# train
model = LitAutoEncoder()
trainer = pl.Trainer(gpus="0,1,2,3,4,5", auto_select_gpus=True, num_nodes=14, max_epochs=3, accelerator='ddp')
trainer.fit(model, train_dl)
  • PyTorch Lightning Version (e.g., 1.3.0): 1.4.9
  • PyTorch Version (e.g., 1.8): 1.9
  • Python version: 3.7
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source): source
  • If compiling from source, the output of torch.__config__.show():
'PyTorch built with:\n  - GCC 7.3\n  - C++ Version: 201402\n  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  - CPU capability usage: VSX\n  - CUDA Runtime 11.0\n  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80\n  - CuDNN 8.1.1  (built against CUDA 11.2)\n  - Magma 2.5.4\n  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=11.0, CUDNN_VERSION=8.1.1, CXX_COMPILER=/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_build_env/bin/powerpc64le-conda_cos7-linux-gnu-c++, CXX_FLAGS=-fvisibility-inlines-hidden -fmessage-length=0 -mcpu=power8 -mtune=power8 -mpower8-fusion -mpower8-vector -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -I/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol/include -fdebug-prefix-map=/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/work=/usr/local/src/conda/pytorch-base-1.9.0 -fdebug-prefix-map=/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol=/usr/local/src/conda-prefix -D__STDC_FORMAT_MACROS -I/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol/include -I/sw/peak/cuda/11.0.3/include -I/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_build_env/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKLDNN=OFF, USE_MPI=0, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=1,

cc @awaelchli @rohitgr7

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

Dear @proutrc,

It seems your LSFEnvironment implementation is more reliable at scale. Would you be willing to make a contribution and test. I believe it should be easy to mock.

Best, T.C

@tchaton I am happy to help however I can, but I can’t claim the implementation. @ajtritt is the one who provided the implementation for the LSFEnvironment that worked for me. I just tested it within my little example.

I may have goofed something with @awaelchli fix. That method may be valid too, but the one using LSB_DJOB_RANKFILE might be easier than LSB_MCPU_HOSTS. I will retry the one using LSB_MPCU_HOSTS again though, to see if I made a mistake last time.

I think I might see what those guys think too.

Could you help me verify that this fixes the issue by creating this custom cluster env:

from pytorch_lightning.plugins.environments import ClusterEnvironment


class NewLSFEnvironment(LSFEnvironment):

    @staticmethod
    def is_using_lsf() -> bool:
        required_env_vars = ("LSB_JOBID", "LSB_MCPU_HOSTS", "JSM_NAMESPACE_LOCAL_RANK", "JSM_NAMESPACE_SIZE")
        return all(v in os.environ for v in required_env_vars)

    @staticmethod
    def _read_hosts():
        hosts_config = os.environ.get("LSB_MCPU_HOSTS", "")
        if not hosts_config:
            raise ValueError("Could not find hosts in environment variable LSB_MCPU_HOSTS")
        host_config = hosts_config.split()

        if len(host_config) % 2 != 0:
            raise ValueError(
                "Cannot parse hosts from LSB_MCPU_HOSTS environment variable. Expected format:"
                ' "<node0_name> <node0_num_procs> <node1_name> ..."'
            )
        return host_config[::2]

    def _get_master_address(self):
        return self._read_hosts()[0]

and adding it to your trainer like so:

trainer = Trainer(plugins=NewLSFEnvironment())

The main change I made is switch to that env variable and adjusted the parsing. I simulated it but can’t run it for real. If you confirm, I can create a fix directly from this.

If things break down on your side, could you let me know the values of the two environment variables LSB_MCPU_HOSTS and LSB_HOSTS.

Could you help me verify that this fixes the issue by creating this custom cluster env:

from pytorch_lightning.plugins.environments import ClusterEnvironment


class NewLSFEnvironment(LSFEnvironment):

    @staticmethod
    def is_using_lsf() -> bool:
        required_env_vars = ("LSB_JOBID", "LSB_MCPU_HOSTS", "JSM_NAMESPACE_LOCAL_RANK", "JSM_NAMESPACE_SIZE")
        return all(v in os.environ for v in required_env_vars)

    @staticmethod
    def _read_hosts():
        hosts_config = os.environ.get("LSB_MCPU_HOSTS", "")
        if not hosts_config:
            raise ValueError("Could not find hosts in environment variable LSB_MCPU_HOSTS")
        host_config = hosts_config.split()

        if len(host_config) % 2 != 0:
            raise ValueError(
                "Cannot parse hosts from LSB_MCPU_HOSTS environment variable. Expected format:"
                ' "<node0_name> <node0_num_procs> <node1_name> ..."'
            )
        return host_config[::2]

    def _get_master_address(self):
        return self._read_hosts()[0]

and adding it to your trainer like so:

trainer = Trainer(plugins=NewLSFEnvironment())

The main change I made is switch to that env variable and adjusted the parsing. I simulated it but can’t run it for real. If you confirm, I can create a fix directly from this.

If things break down on your side, could you let me know the values of the two environment variables LSB_MCPU_HOSTS and LSB_HOSTS.

When trying this override in my small program (listed above) it complains about LSFEnvironment not being defined. I thought maybe you meant import LSFEnviornment at the top, instead of import ClusterEnvironment, but that produces another error. At any rate, I could be doing something silly (likely perhaps), but can you provide a full example of this override?