pytorch-lightning: Code stuck on "initalizing ddp" when using more than one gpu
đ Bug
I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify
pl.Trainer(gpus=[0])
It runs fine. However, once I add another GPU
pl.Trainer(gpus=[0,1,2,3])
I get this output:
GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4 initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4 initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.
Any idea why this may happen? I have tried with both ddp and ddp_spawn.
- PyTorch Version-- tried both 1.4 and 1.7
- OS-- Linux
- Installed with pip
- Python version: 3.8.5
- CUDA/cuDNN version: 10.1
- GPU models and configuration: NVIDIA K80s
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 42
- Comments: 81 (24 by maintainers)
This issue still exists with the latest versions of Pytorch and PyTorch Lightning. In my case, it happens with 1080Ti.
Also found it to work on V100 but encountered similar problem with A100 GPUs.
On A100 GPU with cuda 11.0 and pytorch 1.7, vanilla pytorch DDP (without lightning) works fine, but lightning hangs on âinitializing ddp: GLOBAL_RANK: 0â. Interestingly in our setup lightning works with GPUs [0, 1] and [2, 3] but not with [1, 2]. Vanilla pytorch DDP works in all cases. Any idea whatâs going on?
EDIT: after diving further, it seems that the reason that my vanilla pytorch example was working was because of the âglooâ backend, switching to NCCL replicated the hanging behavior, so this is not a lightning problem.
This solved it for me.
Hi, I had this problem to while running PL on my universityâs SLURM cluster. I was trying to use DDP on 4 GPUs.
Following the tutorial, I set tasks per node to 4 (number of GPUs) and it hung on Initializing DDP.
I solved this by setting tasks per node to 8.
for everyone finding this issue or still having problems, make sure you run with
NCCL_DEBUG=INFOand include it in any bug reports (here or anywhere else). DDP/NCCL hanging can be caused by many things. Setting theNCCL_DEBUG=INFOenvironment variable often tells us why.I had the same issue in my K80 GPU Cluster, and I solved it by using the Gloo interface instead of NCCL as you can see in this discussion here: https://github.com/PyTorchLightning/pytorch-lightning/discussions/6509 (with source code)
I met the same problem as well.
DP works well in PyTorch Lightning, but sometimes DDP stuck on âinitalizing ddpâ.
Please fix it.
Oh, maybe you needed the iommu=âsoftâ setting. See this pytorch issue from 2017, https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158, I had the same problem back then. or maybe
NCCL_P2P_DISABLE=1maybe worth a try also for @jwohlwendThanks for the response @edenlightning. This morning (about 12 hrs after my last attempt), I ran the simple boring model and it ran. I also added gpu=[0] and it worked. When I added gpu=[0,1] it worked the first time I ran it. I successfully ctrl+câd out of it, then tried to run it again and it hangs forever. nvidia-smi shows no processes running and I can continually run the single gpu approach after this. However, anytime I do gpu > 1 from now on it does not run.
Thus, it seems like something bad happens when I ctrl+c a ddp process that blocks it from happening again. Any idea what that might be?
Hi guys, I ran into the same problem, I am using slurm from university. just change --ntasks-per-node=1 solves my problem
@shimengfeng Potentially helpful, but certainly not ideal, forcing a switch to using gloo helped in my case â setting
torch_backendto gloo and switching off the if statement here https://github.com/PyTorchLightning/pytorch-lightning/blob/1.1.6/pytorch_lightning/plugins/ddp_plugin.pyI am currently using torch 1.6 and lightning 1.1.5, and I still get stuck at the initialization step. I can successfully run accelerator = âdpâ but just not âddpâ. Am I suppose to change anything in my code to make it work? I am under the impression that I can directly change the accelerator and it should run. Also, I am running the code on databricks platform, with their p3.8xlarge. Not sure whether this will matter?
Yes, I can confirm also itâs a pytorch issue, we have seen it before https://github.com/PyTorchLightning/pytorch-lightning/issues/5264#issuecomment-751323437 The solution is to just avoid pytorch 1.7 and go with 1.6 or 1.8 đ
@justusschock mind taking a look?
Decreasing --cpus-per-task did the trick, I can also see the NCCL_DEBUG information in the logs now.
The print statements came out as:
and the training was able to start. Thank you @awaelchli!
I am using a SLURM cluster and am experiencing the same problem when I try to use 2 GPUs on the same node for trainer.fit(). I tested using the Boring model and a pytorch torchvision model wrapped in a Lightning module, and the process hangs here:
SLURM flags:
I am able to use DP without issues but not DDP or DDP2. I set NCCL_DEBUG=INFO in my slurm batch script, but I donât see any extra information. I saw in this issue that num_gpus*num_nodes in the trainer should be the same as --ntasks in SLURM. I have set my trainer as follows:
pytorch version: 1.9.0 PL version: 1.4.9 types of GPUs: RTX 2080ti, 8 GPUs per node
Hey, thanks for reporting it. I believe the conclusion of your observation is not correct:
The decision here depends solely on the cluster environment. If the cluster has spawned the processes, Lightning will not spawn them. If cluster has NOT spawned them, Lightning will do so if requested.
In the latter case, local_rank=0 holds true since at this point the only process present must be the main one!
Letâs move this discussion to a separate issue and then we will request more information of how you are launching the ddp job. Thanks for your understanding.
@calebclayreagor that also happens to me from time to time. We have recently identified some problems where processes donât shut down properly. The changes that this happens will be greatly reduced after #6864 is merged.
You can kill all stuck python processes with
pkill -9 pythonAs a newbie, I did not know to check my GPU processes using
nvidia-smibut it turns out my problem was simply ghost processes that needed to be killed (kill -9 <pid>) before initializing ddp again. Closed for me.I had this same issue, and as per @pgmikhael, switching the torch backend to âglooâ gets past the hang, but the execution is slower than single GPU using âncclâ
Iâm having the same problem in the universityâs slurm. Iâm using PyTorch 1.6 (as mentioned above) but still gets stuck on
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2@shashank2000 I was told that databricks does not support ddp as of now, and I ended up using horovod and it worked for me that way.
@shimengfeng unfortunately no. Iâm not able to reproduce based on all reported conditions. What worked for me is skipping pytorch 1.7
I have the same problem. Weirdly, it works well with V100 and P100 GPUs. But when I try using Tesla T4 GPUs, the code hangs.
@justusschock nothing, it just stays frozen. I am forced to ctrl+z.
@justusschock Correct. I canât even ctrl+c I have to usually ctrl+z and then manually kill the process when I use ddp.