pytorch-lightning: NCCL error using DDP and PyTorch 1.7
đ Bug
Getting this error when attempting to use ddp with the âgetting startedâ autoencoder example:
Stack Trace:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "01_getting_started_autoencoder.py", line 66, in <module>
modle, trainer = cli_main()
File "01_getting_started_autoencoder.py", line 60, in cli_main
trainer.fit(model, train_dl)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
Traceback (most recent call last):
File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 66, in <module>
results = self.accelerator_backend.train()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
self.trainer.is_slurm_managing_tasks
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
torch_backend, rank=global_rank, world_size=world_size
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
modle, trainer = cli_main()
File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 60, in cli_main
trainer.fit(model, train_dl)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
self.trainer.is_slurm_managing_tasks
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
torch_backend, rank=global_rank, world_size=world_size
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
To Reproduce
Follow the code in the getting started question with these parameters to Trainer:
model = LitAutoEncoder()
trainer = pl.Trainer(gpus='1,2', distributed_backend='ddp')
trainer.fit(model, train_dl)
Expected behavior
For it to train on multiple GPUs đ
Environment
- PyTorch Version 1.7:
- OS (e.g., Linux): Ubuntu 18.04
- How you installed PyTorch (
conda,pip, source): pip - Build command you used (if compiling from source): n/a
- Python version: 3.7
- CUDA/cuDNN version: 10.2/7.6.5
- GPU models and configuration: 2 1080Tis
- Any other relevant information: n/a
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 12
- Comments: 56 (14 by maintainers)
Commits related to this issue
- trying fix for not setting up Fix from https://github.com/PyTorchLightning/pytorch-lightning/issues/4420 — committed to m-dml/plankton-classifier by t-schanz 3 years ago
Actually, I found out the reason. It seems that my unit test is trying to start a world_size=3 on 2 GPUs. The error msg is definitely hard to parse. It would be nice that dist.init_process_group just check the world_size.
FWIW, gloo backend works fine in this case.
same error with A100 gpus.
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
fixed it using
export NCCL_IB_DISABLE=1Have the same issue with single node 2x rtx 3090 on ubuntu 18.04 using pytorch 1.7, Driver Version: 455.45.01 CUDA Version: 11.1 , pytorch-lightning 1.0.8
Realizing this bug is very misleading as it seems to be the landing point of every NCCL error for PyTorch and DDP. NCCL errors are varied as NCCL encompasses CUDA, NVLink, networking (Sockets and Infiniband/RoCE), and other mechanisms like shared memory, as well as performing topology detection to optimize communication between GPUs. So different users will have very different problems which need to be solved in different ways.
The first thing to do whenever a NCCL error happens, as suggested by the NCCL troubleshooting page is to run again with
NCCL_DEBUG=WARN. That will give a precise error message of why NCCL failed and hopefully help fix the problem. If that message isnât clear enough, feel free to report the issue to the NCCL github project: https://github.com/nvidia/nccl.Now, rewinding the bug to try to categorize the different issuesâŚ
In the first part of the issue (@min-xu-ai and @ohmeow) the error reported by NCCL is
ncclInvalidUsage. WithNCCL_DEBUG=WARN, there would be a message like the one below, which would hopefully have helped.Then, @mhpfuchs probably got a
ncclSystemErrorin the Infiniband code (notncclInvalidUsage) and âfixedâ it by disabling IB. It would have been good to understand what that error was and perhaps fix the IB setup to make it functional, as sockets have much lower performance than IB, and a much higher CPU usage. In many cases, thatâs due toulimit -lnot being high enough to allow proper IB operation. Now sometimes it happens that there is an IB interface which is active yet not functional due to the network fabric not being properly setup, in which caseNCCL_IB_DISABLE=1is the proper workaround.After that, @brando90 got a
ncclUnhandledCudaError(notncclSystemError, norncclInvalidUsage), like @MInner, @v-nhandt21 and @universome as well I guess. At least in one case, the error was:Setting
NCCL_P2P_DISABLE=1is a proper workaround in that case, but not all CUDA errors are this one and the solution could be very different depending on what CUDA issue we encountered.Finally, @awaelchli got a ncclSystemError due to too little shared memory being available; setting NCCL_DEBUG=WARN would have probably printed something like:
and helped fix the problem as well.
I tested the following with our examples: ddp 1080ti pytorch 1.7: error ddp 1080ti pytorch 1.6: good ddp 2080ti pytorch 1.7: good ddp 2080ti pytorch 1.6: good
so far was not able reproduce with pytorch examples đŚ need to dig deep
For those with A100s, export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 works like a charm.
pytorch closed their issue because this issue exists and you close this issue because their issue existsâŚ
this does not work for me see:
perhaps useful: https://stackoverflow.com/questions/61075390/about-pytorch-nccl-error-unhandled-system-error-nccl-version-2-4-8
For me I am not 100% what fixed it but I did:
as suggested here: https://pytorch.org/docs/stable/distributed.html#common-environment-variables
I set this in my script: CUDA_VISIBLE_DEVICES=â0,1,2,3â Also, I set num_gpus=3
I found that when num_gpus doesnât match the number of visible GPUs, NCCL error occurs. When I set num_gpus=4, or CUDA_VISIBLE_DEVICES=â0,1,2â, it works fine.
I also had
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failederror. When I setNCCL_DEBUG=INFOit also showedNCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted. Settingexport NCCL_P2P_DISABLE=1helped, though there might be a performance penalty?I can confirm the same error using the latest Lightning and PyTorch using Tesla V100s. Does not happen on a single node with 2 GPUs, but once I go to multiple nodes the error happens.
If you land here on this thread because you got an NCCL error and it looks like this (not exactly what OP posted):
It may be because you have too little shared memory. The solution is to increase the shared memory (google it for your operating system) or if you use docker set
--shm-size="1G"or some acceptable number.General advice for NCCL errors: Run your command with the environment variable
NCCL_DEBUG=INFOand collect all the messages it prints.For others who might run into this:
In previous PyTorch Lightning versions, the Trainer received an argument
distributed_backend. You now need to rename it toaccelerator.Have the same issue with 2x2080ti on ubuntu 20.04 using pytorch 1.7 and cuda 11. downgrading to pytorch 1.6 and cuda 10.2 fixes the issue
I have the same issue on 1080ti, with V100 GPUs everything works fine.
Same bug on V100 32GB torch1.8 cuda10.1
ok, I can confirm this is only happening on pytorch 1.7
export NCCL_SHM_DISABLE=1 This works for me on v100
The package versions:
pytorch: 1.8.1
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py +270
â>
In my case, this error is caused by the local_rank parameter not being passed in. Itâs a bug.
Hi, Iâm also getting this error when using multi GPUs with mixed precision, the package versions are
The GPUs are A100 with NVIDIA driver Version: 450.51.06, CUDA Version: 11.0.
The error message is
I tried the following env vars but it didnât work
My trainer call is
Meanwhile, it works with native PyTorch using
torch.cuda.amp.autocast()for mixed-precision andnn.DataParallelfor multi-GPU support.Is there any suggestions for fixing this error?
Yah Iâm using 1.0.4
Hereâs the full source for my .py file:
This disables some important nccl features and you could simply use gloo backend instead (which works fine by default). Disabling these features can potentially decrease the performance (especially if you use nvlink). However, I just tried your suggestion and my StyleGAN2-ADA training speed on 4x A6000s not only decreased, but even slightly improved (by 2%). But note, that I do not have nvlink
This solution works in my case. Thanks!