pytorch-lightning: Code stuck on "initalizing ddp" when using more than one gpu

🐛 Bug

I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify

pl.Trainer(gpus=[0])

It runs fine. However, once I add another GPU

pl.Trainer(gpus=[0,1,2,3])

I get this output:

GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4 initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4 initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.

Any idea why this may happen? I have tried with both ddp and ddp_spawn.

PyTorch Version-- tried both 1.4 and 1.7
OS-- Linux
Installed with pip
Python version: 3.8.5
CUDA/cuDNN version: 10.1
GPU models and configuration: NVIDIA K80s

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 42
Comments: 81 (24 by maintainers)

Most upvoted comments

Any updates on this?

This issue still exists with the latest versions of Pytorch and PyTorch Lightning. In my case, it happens with 1080Ti.

+10

hoangtnm on Jun 20, 2021

Also found it to work on V100 but encountered similar problem with A100 GPUs.

+10

pgmikhael on Jan 13, 2021

On A100 GPU with cuda 11.0 and pytorch 1.7, vanilla pytorch DDP (without lightning) works fine, but lightning hangs on “initializing ddp: GLOBAL_RANK: 0”. Interestingly in our setup lightning works with GPUs [0, 1] and [2, 3] but not with [1, 2]. Vanilla pytorch DDP works in all cases. Any idea what’s going on?

EDIT: after diving further, it seems that the reason that my vanilla pytorch example was working was because of the “gloo” backend, switching to NCCL replicated the hanging behavior, so this is not a lightning problem.

jwohlwend on Jan 13, 2021

This solved it for me.

Don’t set CUDA_LAUNCH_BLOCKING=1
Use PyTorch nightly build. My PLT version=1.4.2.

Gateway2745 on Aug 26, 2021

Hi, I had this problem to while running PL on my university’s SLURM cluster. I was trying to use DDP on 4 GPUs.

Following the tutorial, I set tasks per node to 4 (number of GPUs) and it hung on Initializing DDP.

I solved this by setting tasks per node to 8.

SohamTamba on Nov 19, 2020

for everyone finding this issue or still having problems, make sure you run with NCCL_DEBUG=INFO and include it in any bug reports (here or anywhere else). DDP/NCCL hanging can be caused by many things. Setting the NCCL_DEBUG=INFO environment variable often tells us why.

awaelchli on Jul 20, 2021

I had the same issue in my K80 GPU Cluster, and I solved it by using the Gloo interface instead of NCCL as you can see in this discussion here: https://github.com/PyTorchLightning/pytorch-lightning/discussions/6509 (with source code)

andrewssobral on Mar 31, 2021

I met the same problem as well.

DP works well in PyTorch Lightning, but sometimes DDP stuck on “initalizing ddp”.

Please fix it.

PengyuWang on Mar 15, 2021

Oh, maybe you needed the iommu=“soft” setting. See this pytorch issue from 2017, https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158, I had the same problem back then. or maybe NCCL_P2P_DISABLE=1 maybe worth a try also for @jwohlwend

awaelchli on Jan 14, 2021

Thanks for the response @edenlightning. This morning (about 12 hrs after my last attempt), I ran the simple boring model and it ran. I also added gpu=[0] and it worked. When I added gpu=[0,1] it worked the first time I ran it. I successfully ctrl+c’d out of it, then tried to run it again and it hangs forever. nvidia-smi shows no processes running and I can continually run the single gpu approach after this. However, anytime I do gpu > 1 from now on it does not run.

Thus, it seems like something bad happens when I ctrl+c a ddp process that blocks it from happening again. Any idea what that might be?

JosephGatto on Nov 11, 2020

Hi guys, I ran into the same problem, I am using slurm from university. just change --ntasks-per-node=1 solves my problem

ImNotPrepared on Aug 21, 2023

@shimengfeng Potentially helpful, but certainly not ideal, forcing a switch to using gloo helped in my case – setting torch_backend to gloo and switching off the if statement here https://github.com/PyTorchLightning/pytorch-lightning/blob/1.1.6/pytorch_lightning/plugins/ddp_plugin.py

pgmikhael on Jan 31, 2021

I am currently using torch 1.6 and lightning 1.1.5, and I still get stuck at the initialization step. I can successfully run accelerator = ‘dp’ but just not ‘ddp’. Am I suppose to change anything in my code to make it work? I am under the impression that I can directly change the accelerator and it should run. Also, I am running the code on databricks platform, with their p3.8xlarge. Not sure whether this will matter?

shimengfeng on Jan 21, 2021

Yes, I can confirm also it’s a pytorch issue, we have seen it before https://github.com/PyTorchLightning/pytorch-lightning/issues/5264#issuecomment-751323437 The solution is to just avoid pytorch 1.7 and go with 1.6 or 1.8 😃

awaelchli on Jan 14, 2021

@justusschock mind taking a look?

edenlightning on Nov 17, 2020

Decreasing --cpus-per-task did the trick, I can also see the NCCL_DEBUG information in the logs now.

The print statements came out as:

2 0 0 0
2 1 1 0
<pytorch_lightning.plugins.training_type.ddp.DDPPlugin object at 0x2aedab681730>

and the training was able to start. Thank you @awaelchli!

kritiyer on Oct 22, 2021

I am using a SLURM cluster and am experiencing the same problem when I try to use 2 GPUs on the same node for trainer.fit(). I tested using the Boring model and a pytorch torchvision model wrapped in a Lightning module, and the process hangs here:

Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

SLURM flags:

#!/bin/bash
#SBATCH --job-name=lightning_multiGPU_boring
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem=12GB
#SBATCH --gres=gpu:2

I am able to use DP without issues but not DDP or DDP2. I set NCCL_DEBUG=INFO in my slurm batch script, but I don’t see any extra information. I saw in this issue that num_gpus*num_nodes in the trainer should be the same as --ntasks in SLURM. I have set my trainer as follows:

trainer = pl.Trainer(gpus=[0,1], num_nodes = 1, accelerator="ddp", max_epochs=args.epochs, callbacks=[early_stopping, checkpoint_callback], benchmark=True)

pytorch version: 1.9.0 PL version: 1.4.9 types of GPUs: RTX 2080ti, 8 GPUs per node

kritiyer on Oct 22, 2021

Hey, thanks for reporting it. I believe the conclusion of your observation is not correct:

removing os.environ… makes all created subprocesses go into _call_children_scripts() and then fail with assert local_rank == 0 causing the block, a solution would be reinstating the os check or adding:

The decision here depends solely on the cluster environment. If the cluster has spawned the processes, Lightning will not spawn them. If cluster has NOT spawned them, Lightning will do so if requested.

In the latter case, local_rank=0 holds true since at this point the only process present must be the main one!

Let’s move this discussion to a separate issue and then we will request more information of how you are launching the ddp job. Thanks for your understanding.

awaelchli on Aug 19, 2021

@calebclayreagor that also happens to me from time to time. We have recently identified some problems where processes don’t shut down properly. The changes that this happens will be greatly reduced after #6864 is merged.

You can kill all stuck python processes with pkill -9 python

awaelchli on Apr 13, 2021

As a newbie, I did not know to check my GPU processes using nvidia-smi but it turns out my problem was simply ghost processes that needed to be killed (kill -9 <pid>) before initializing ddp again. Closed for me.

calebclayreagor on Apr 12, 2021

I had this same issue, and as per @pgmikhael, switching the torch backend to ‘gloo’ gets past the hang, but the execution is slower than single GPU using ‘nccl’

calebclayreagor on Mar 30, 2021

I’m having the same problem in the university’s slurm. I’m using PyTorch 1.6 (as mentioned above) but still gets stuck on initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

SapirWeissbuch on Mar 8, 2021

@shimengfeng exact same config and exact same issue - any luck?

@shashank2000 I was told that databricks does not support ddp as of now, and I ended up using horovod and it worked for me that way.

shimengfeng on Feb 1, 2021

@shimengfeng unfortunately no. I’m not able to reproduce based on all reported conditions. What worked for me is skipping pytorch 1.7

awaelchli on Jan 21, 2021

I have the same problem. Weirdly, it works well with V100 and P100 GPUs. But when I try using Tesla T4 GPUs, the code hangs.

shreyaskamathkm on Jan 4, 2021

@justusschock nothing, it just stays frozen. I am forced to ctrl+z.

JosephGatto on Nov 19, 2020

@justusschock Correct. I can’t even ctrl+c I have to usually ctrl+z and then manually kill the process when I use ddp.

JosephGatto on Nov 19, 2020