pytorch-lightning: Training stuck running on the SLURM cluster with multiple gpus per node

🐛 Bug

I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:

trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

and submit the job with sbatch run_training.sh . However, I end up with the following output and nothing happens further:

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.

run_training.sh

#!/bin/bash
#SBATCH -o slurm_outfiles/autoencoder-%j-%A-%a.out
#SBATCH -N 2
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G

srun python torch_ddp_toy.py

torch_ddp_toy.py

import pytorch_lightning as pl
import torch
from torch import nn

class Module(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.linear.parameters())

    def training_step(self, batch, batch_idx):
        return self.linear(batch).sum()

    def validation_step(self, batch, batch_idx):
        return batch_idx

    def validation_epoch_end(self, outputs):
        print("VALIDATING", len(outputs))


if __name__ == "__main__":
    m = Module()

    datasets = [torch.rand([5]) for __ in range(100)]
    train_loader = torch.utils.data.DataLoader(datasets, batch_size=8)
    val_loader = torch.utils.data.DataLoader(datasets, batch_size=1)

    trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )
    trainer.fit(m, train_loader, val_loader)

PyTorch version 1.7.1
PyTorch Lightning version 1.2.0
CentOS Linux release 8.1.1911
PyTorch installed via conda
PyTorch Lightning via pip
slurm 20.02.3

UPDATE: added version of PyTorch Lightning

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 24 (10 by maintainers)

Most upvoted comments

Removing num_nodes argument from training configuration solved the same problem for me.

hkmztrk on Mar 1, 2021

It requires the env variables from SLURM to be detected
SLURM_JOB_ID
SLURM_PROCID
SLURM_LOCALID
SLURM_NODEID
SLURM_NTASKS
SLURM_NTASKS must match num_noses * num_gpus in the Trainer.

This is what resolved the problem. These variables are important for it to work at least on the SLURM version that my institution is using. Here is the change in my script for allocation that resolved the problem:

#SBATCH --tasks-per-node=4
#SBATCH --mem 185G
#SBATCH --cpus-per-task=8
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Before it was:

#SBATCH --mem 185G
#SBATCH -c 32
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Hope this helps 😄

haideraltahan on Jun 18, 2021

Thanks for your answer. I used the same code as above and just changed the parameters to use two gpu and two nodes

trainer = pl.Trainer(
      gpus=2,  num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

The output is little different but still get stuck at the same stage.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

DerJFK on Apr 20, 2021

Thanks for reporting, could you update the issue with the pytorch lightning version you used, please?

awaelchli on Feb 26, 2021