pytorch-lightning: NCCL error using DDP and PyTorch 1.7

🐛 Bug

Getting this error when attempting to use ddp with the “getting started” autoencoder example:

Stack Trace:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "01_getting_started_autoencoder.py", line 66, in <module>
    modle, trainer = cli_main()
  File "01_getting_started_autoencoder.py", line 60, in cli_main
    trainer.fit(model, train_dl)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
Traceback (most recent call last):
  File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 66, in <module>
    results = self.accelerator_backend.train()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
    torch_backend, rank=global_rank, world_size=world_size
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    modle, trainer = cli_main()
  File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 60, in cli_main
    trainer.fit(model, train_dl)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
    torch_backend, rank=global_rank, world_size=world_size
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

To Reproduce

Follow the code in the getting started question with these parameters to Trainer:

model = LitAutoEncoder()
trainer = pl.Trainer(gpus='1,2', distributed_backend='ddp')
trainer.fit(model, train_dl)

Expected behavior

For it to train on multiple GPUs 😃

Environment

  • PyTorch Version 1.7:
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source): n/a
  • Python version: 3.7
  • CUDA/cuDNN version: 10.2/7.6.5
  • GPU models and configuration: 2 1080Tis
  • Any other relevant information: n/a

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 12
  • Comments: 56 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Actually, I found out the reason. It seems that my unit test is trying to start a world_size=3 on 2 GPUs. The error msg is definitely hard to parse. It would be nice that dist.init_process_group just check the world_size.

FWIW, gloo backend works fine in this case.

same error with A100 gpus.

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

fixed it using export NCCL_IB_DISABLE=1

Have the same issue with single node 2x rtx 3090 on ubuntu 18.04 using pytorch 1.7, Driver Version: 455.45.01 CUDA Version: 11.1 , pytorch-lightning 1.0.8

Realizing this bug is very misleading as it seems to be the landing point of every NCCL error for PyTorch and DDP. NCCL errors are varied as NCCL encompasses CUDA, NVLink, networking (Sockets and Infiniband/RoCE), and other mechanisms like shared memory, as well as performing topology detection to optimize communication between GPUs. So different users will have very different problems which need to be solved in different ways.

The first thing to do whenever a NCCL error happens, as suggested by the NCCL troubleshooting page is to run again with NCCL_DEBUG=WARN. That will give a precise error message of why NCCL failed and hopefully help fix the problem. If that message isn’t clear enough, feel free to report the issue to the NCCL github project: https://github.com/nvidia/nccl.

Now, rewinding the bug to try to categorize the different issues…

In the first part of the issue (@min-xu-ai and @ohmeow) the error reported by NCCL is ncclInvalidUsage. With NCCL_DEBUG=WARN, there would be a message like the one below, which would hopefully have helped.

NCCL WARN Duplicate GPU detected : rank 0 and rank 3 both on CUDA device 0

Then, @mhpfuchs probably got a ncclSystemError in the Infiniband code (not ncclInvalidUsage) and “fixed” it by disabling IB. It would have been good to understand what that error was and perhaps fix the IB setup to make it functional, as sockets have much lower performance than IB, and a much higher CPU usage. In many cases, that’s due to ulimit -l not being high enough to allow proper IB operation. Now sometimes it happens that there is an IB interface which is active yet not functional due to the network fabric not being properly setup, in which case NCCL_IB_DISABLE=1 is the proper workaround.

After that, @brando90 got a ncclUnhandledCudaError (not ncclSystemError, nor ncclInvalidUsage), like @MInner, @v-nhandt21 and @universome as well I guess. At least in one case, the error was:

NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted

Setting NCCL_P2P_DISABLE=1 is a proper workaround in that case, but not all CUDA errors are this one and the solution could be very different depending on what CUDA issue we encountered.

Finally, @awaelchli got a ncclSystemError due to too little shared memory being available; setting NCCL_DEBUG=WARN would have probably printed something like:

NCCL WARN Error while creating shared memory segment /dev/shm/nccl-... (size ...)

and helped fix the problem as well.

I tested the following with our examples: ddp 1080ti pytorch 1.7: error ddp 1080ti pytorch 1.6: good ddp 2080ti pytorch 1.7: good ddp 2080ti pytorch 1.6: good

so far was not able reproduce with pytorch examples 😦 need to dig deep

For those with A100s, export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 works like a charm.

pytorch closed their issue because this issue exists and you close this issue because their issue exists…

fixed it using export NCCL_IB_DISABLE=1

this does not work for me see:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1432729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

(automl-meta-learning) miranda9~/ML4Coq $ echo $NCCL_IB_DISABLE
1

perhaps useful: https://stackoverflow.com/questions/61075390/about-pytorch-nccl-error-unhandled-system-error-nccl-version-2-4-8

For me I am not 100% what fixed it but I did:

$ export NCCL_SOCKET_IFNAME=eth0
$ export NCCL_IB_DISABLE=1

as suggested here: https://pytorch.org/docs/stable/distributed.html#common-environment-variables

I set this in my script: CUDA_VISIBLE_DEVICES=‘0,1,2,3’ Also, I set num_gpus=3

I found that when num_gpus doesn’t match the number of visible GPUs, NCCL error occurs. When I set num_gpus=4, or CUDA_VISIBLE_DEVICES=‘0,1,2’, it works fine.

I also had RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed error. When I set NCCL_DEBUG=INFO it also showed NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted. Setting export NCCL_P2P_DISABLE=1 helped, though there might be a performance penalty?

I can confirm the same error using the latest Lightning and PyTorch using Tesla V100s. Does not happen on a single node with 2 GPUs, but once I go to multiple nodes the error happens.

If you land here on this thread because you got an NCCL error and it looks like this (not exactly what OP posted):

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 E ncclSystemError: System call (socket, malloc, munmap, etc) failed

It may be because you have too little shared memory. The solution is to increase the shared memory (google it for your operating system) or if you use docker set --shm-size="1G" or some acceptable number.

General advice for NCCL errors: Run your command with the environment variable NCCL_DEBUG=INFO and collect all the messages it prints.

For others who might run into this:

In previous PyTorch Lightning versions, the Trainer received an argument distributed_backend. You now need to rename it to accelerator.

Have the same issue with 2x2080ti on ubuntu 20.04 using pytorch 1.7 and cuda 11. downgrading to pytorch 1.6 and cuda 10.2 fixes the issue

I have the same issue on 1080ti, with V100 GPUs everything works fine.

Same bug on V100 32GB torch1.8 cuda10.1

ok, I can confirm this is only happening on pytorch 1.7

export NCCL_SHM_DISABLE=1 This works for me on v100

The package versions:
pytorch: 1.8.1
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py +270

        if not args.use_env:
            cmd.append("--local_rank={}".format(local_rank))
        cmd.extend(args.training_script_args)

–>

        cmd.extend(args.training_script_args)
        if not args.use_env:
            cmd.append("--local_rank={}".format(local_rank))

In my case, this error is caused by the local_rank parameter not being passed in. It’s a bug.

Hi, I’m also getting this error when using multi GPUs with mixed precision, the package versions are

python=3.7
pytorch==1.8.0
torchvision==0.9.0
torchaudio==0.8.0
cudatoolkit=11.1

The GPUs are A100 with NVIDIA driver Version: 450.51.06, CUDA Version: 11.0.

The error message is

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378098133/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

I tried the following env vars but it didn’t work

export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

My trainer call is

trainer = pl.Trainer(
        gpus=args.gpus,
        amp_level="O2",
        precision=16,
        accumulate_grad_batches=args.acc_batch_size // args.batch_size,
        accelerator="ddp",
        plugins=DDPPlugin(find_unused_parameters=True),
        max_epochs=args.epochs,
    )

Meanwhile, it works with native PyTorch using torch.cuda.amp.autocast() for mixed-precision and nn.DataParallel for multi-GPU support.

Is there any suggestions for fixing this error?

Yah I’m using 1.0.4

Here’s the full source for my .py file:

import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from torch.utils.data import random_split


# define pl module
class LitAutoEncoder(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 64),
            nn.ReLU(),
            nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64),
            nn.ReLU(),
            nn.Linear(64, 28*28)
        )

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defined the train loop.
        # It is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)

        # Logging to TensorBoard by default
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# define datasets/dataloaders
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train_dl = DataLoader(dataset)


# train
model = LitAutoEncoder()
trainer = pl.Trainer(gpus='0,1', distributed_backend='ddp')
trainer.fit(model, train_dl)

For those with A100s, export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 works like a charm.

This disables some important nccl features and you could simply use gloo backend instead (which works fine by default). Disabling these features can potentially decrease the performance (especially if you use nvlink). However, I just tried your suggestion and my StyleGAN2-ADA training speed on 4x A6000s not only decreased, but even slightly improved (by 2%). But note, that I do not have nvlink

fixed it using export NCCL_IB_DISABLE=1

This solution works in my case. Thanks!