pytorch-lightning: DDP initialization breaks on HPC system at larger scale
We are experiencing an issue with DDP at larger scales on our HPC system (Summit at OLCF - LSF scheduler). The specific threshold is at 14 nodes, where things suddenly aren’t able to initialize anymore. It appears there is all of a sudden an inability to setup ranks across nodes properly, as depicted below in the output.
Each node as 6 GPUs. So, in total, we are trying to use 84 GPUs when things are suddenly unable to initialize. At 78 GPUs (or 13 nodes) things work as expected.
Initialization output at 13 nodes (78 GPUs - this works as expected):
initializing ddp: GLOBAL_RANK: 72, MEMBER: 73/78
initializing ddp: GLOBAL_RANK: 36, MEMBER: 37/78
initializing ddp: GLOBAL_RANK: 10, MEMBER: 11/78
initializing ddp: GLOBAL_RANK: 19, MEMBER: 20/78
initializing ddp: GLOBAL_RANK: 32, MEMBER: 33/78
initializing ddp: GLOBAL_RANK: 12, MEMBER: 13/78
initializing ddp: GLOBAL_RANK: 41, MEMBER: 42/78
initializing ddp: GLOBAL_RANK: 74, MEMBER: 75/78
initializing ddp: GLOBAL_RANK: 27, MEMBER: 28/78
initializing ddp: GLOBAL_RANK: 21, MEMBER: 22/78
initializing ddp: GLOBAL_RANK: 13, MEMBER: 14/78
initializing ddp: GLOBAL_RANK: 73, MEMBER: 74/78
initializing ddp: GLOBAL_RANK: 25, MEMBER: 26/78
initializing ddp: GLOBAL_RANK: 24, MEMBER: 25/78
initializing ddp: GLOBAL_RANK: 28, MEMBER: 29/78
initializing ddp: GLOBAL_RANK: 67, MEMBER: 68/78
initializing ddp: GLOBAL_RANK: 69, MEMBER: 70/78
initializing ddp: GLOBAL_RANK: 68, MEMBER: 69/78
initializing ddp: GLOBAL_RANK: 47, MEMBER: 48/78
initializing ddp: GLOBAL_RANK: 56, MEMBER: 57/78
initializing ddp: GLOBAL_RANK: 71, MEMBER: 72/78
initializing ddp: GLOBAL_RANK: 53, MEMBER: 54/78
initializing ddp: GLOBAL_RANK: 66, MEMBER: 67/78
initializing ddp: GLOBAL_RANK: 70, MEMBER: 71/78
initializing ddp: GLOBAL_RANK: 55, MEMBER: 56/78
initializing ddp: GLOBAL_RANK: 49, MEMBER: 50/78
initializing ddp: GLOBAL_RANK: 44, MEMBER: 45/78
initializing ddp: GLOBAL_RANK: 46, MEMBER: 47/78
initializing ddp: GLOBAL_RANK: 57, MEMBER: 58/78
initializing ddp: GLOBAL_RANK: 51, MEMBER: 52/78
initializing ddp: GLOBAL_RANK: 45, MEMBER: 46/78
initializing ddp: GLOBAL_RANK: 54, MEMBER: 55/78
initializing ddp: GLOBAL_RANK: 50, MEMBER: 51/78
initializing ddp: GLOBAL_RANK: 59, MEMBER: 60/78
initializing ddp: GLOBAL_RANK: 52, MEMBER: 53/78
initializing ddp: GLOBAL_RANK: 43, MEMBER: 44/78
initializing ddp: GLOBAL_RANK: 58, MEMBER: 59/78
initializing ddp: GLOBAL_RANK: 42, MEMBER: 43/78
initializing ddp: GLOBAL_RANK: 48, MEMBER: 49/78
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 78 processes
Failed initialization at 14 nodes (84 GPUs - this hangs at this point):
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/84
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/84
Here is the code:
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(28*28, 64),
nn.ReLU(),
nn.Linear(64, 3)
)
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28*28)
)
def forward(self, x):
# in lightning, forward defines the prediction/inference actions
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
# training_step defined the train loop.
# It is independent of forward
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
# Logging to TensorBoard by default
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
# define datasets/dataloaders
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train_dl = DataLoader(dataset)
# train
model = LitAutoEncoder()
trainer = pl.Trainer(gpus="0,1,2,3,4,5", auto_select_gpus=True, num_nodes=14, max_epochs=3, accelerator='ddp')
trainer.fit(model, train_dl)
- PyTorch Lightning Version (e.g., 1.3.0):
1.4.9 - PyTorch Version (e.g., 1.8):
1.9 - Python version:
3.7 - OS (e.g., Linux):
- CUDA/cuDNN version:
- GPU models and configuration:
- How you installed PyTorch (
conda,pip, source):source - If compiling from source, the output of
torch.__config__.show():
'PyTorch built with:\n - GCC 7.3\n - C++ Version: 201402\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - CPU capability usage: VSX\n - CUDA Runtime 11.0\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80\n - CuDNN 8.1.1 (built against CUDA 11.2)\n - Magma 2.5.4\n - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=11.0, CUDNN_VERSION=8.1.1, CXX_COMPILER=/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_build_env/bin/powerpc64le-conda_cos7-linux-gnu-c++, CXX_FLAGS=-fvisibility-inlines-hidden -fmessage-length=0 -mcpu=power8 -mtune=power8 -mpower8-fusion -mpower8-vector -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -I/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol/include -fdebug-prefix-map=/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/work=/usr/local/src/conda/pytorch-base-1.9.0 -fdebug-prefix-map=/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol=/usr/local/src/conda-prefix -D__STDC_FORMAT_MACROS -I/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehol/include -I/sw/peak/cuda/11.0.3/include -I/gpfs/alpine/world-shared/stf007/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/pytorch-base_1633116212289/_build_env/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKLDNN=OFF, USE_MPI=0, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=1,
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (9 by maintainers)
@tchaton I am happy to help however I can, but I can’t claim the implementation. @ajtritt is the one who provided the implementation for the LSFEnvironment that worked for me. I just tested it within my little example.
I may have goofed something with @awaelchli fix. That method may be valid too, but the one using
LSB_DJOB_RANKFILEmight be easier thanLSB_MCPU_HOSTS. I will retry the one usingLSB_MPCU_HOSTSagain though, to see if I made a mistake last time.I think I might see what those guys think too.
Could you help me verify that this fixes the issue by creating this custom cluster env:
and adding it to your trainer like so:
The main change I made is switch to that env variable and adjusted the parsing. I simulated it but can’t run it for real. If you confirm, I can create a fix directly from this.
If things break down on your side, could you let me know the values of the two environment variables LSB_MCPU_HOSTS and LSB_HOSTS.
When trying this override in my small program (listed above) it complains about LSFEnvironment not being defined. I thought maybe you meant
import LSFEnviornmentat the top, instead ofimport ClusterEnvironment, but that produces another error. At any rate, I could be doing something silly (likely perhaps), but can you provide a full example of this override?