mmdetection: Multi-gpu training gets stuck

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug

Single GPU training works fine, single node multi-GPU doesn’t. Relevant: #3823, #2193, #1979, #4535, maybe #3973 Rolling back intel-openmp doesn’t help, and it uses only default configs.

I.e. running this is ok:

python tools/train.py configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py

Outputs:

...
2021-11-17 17:33:20,582 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,583 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,584 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,586 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,588 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,589 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:33:20,598 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
2021-11-17 17:33:20,615 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}                                                                                       
2021-11-17 17:33:20,621 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]                                                                                                           
loading annotations into memory...
Done (t=0.44s)                                                                                                                                                                                                     
creating index...                                                                                        
index created!                                                                                                                                                                                                     
loading annotations into memory...                
Done (t=0.06s)                                                                                                                                                                                                     
creating index...        
index created!                                                                                                                                                                                                     
2021-11-17 17:33:23,497 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
2021-11-17 17:33:23,597 - mmdet - WARNING - The model and loaded state dict do not match exactly                                                                                                                   
                                                                                                         
size mismatch for roi_head.bbox_head.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([9, 1024]).                                        
size mismatch for roi_head.bbox_head.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([9]).                                                      
size mismatch for roi_head.bbox_head.fc_reg.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([32, 1024]).   
size mismatch for roi_head.bbox_head.fc_reg.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([32]).                                                    
2021-11-17 17:33:23,600 - mmdet - INFO - Start running, host: vince@wombat, work_dir: /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes
2021-11-17 17:33:23,600 - mmdet - INFO - Hooks will be executed in the following order:                                                                                                                            
before_run:                                                                                              
(VERY_HIGH   ) StepLrUpdaterHook                                                                         
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                          
...     
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2021-11-17 17:33:23,600 - mmdet - INFO - workflow: [('train', 1)], max: 8 epochs
2021-11-17 17:33:23,600 - mmdet - INFO - Checkpoints will be saved to /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes by HardDiskBackend.
2021-11-17 17:33:54,886 - mmdet - INFO - Epoch [1][100/23720]   lr: 1.988e-03, eta: 16:25:12, time: 0.312, data_time: 0.025, memory: 4024, loss_rpn_cls: 0.0439, loss_rpn_bbox: 0.0921, loss_cls: 0.7297, acc: 80.1914, loss_bbox: 0.3776, loss: 1.2433
2021-11-17 17:34:23,521 - mmdet - INFO - Epoch [1][200/23720]   lr: 3.986e-03, eta: 15:44:40, time: 0.286, data_time: 0.004, memory: 4073, loss_rpn_cls: 0.0450, loss_rpn_bbox: 0.1027, loss_cls: 0.3770, acc: 86.8848, loss_bbox: 0.2467, loss: 0.7713

Running this is not ok:

./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2

Outputs:

...
2021-11-17 17:31:12,229 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,230 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,231 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,232 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,234 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,236 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,237 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,238 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,239 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,241 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,242 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,246 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,250 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,253 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}                                                                         
2021-11-17 17:31:12,270 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}                                                                             
2021-11-17 17:31:12,287 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}                                                                                       
2021-11-17 17:31:12,290 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]                                                                                                           
loading annotations into memory...                                                                                                                                                                                 
Done (t=0.42s)                                                                                                                                                                                                     
creating index...                                                                                                                                                                                                  
index created!                                                                                                                                                                                                     
loading annotations into memory...                                                                                                                                                                                 
Done (t=0.06s)                                                                                                                                                                                                     
creating index...                                                                                                                                                                                                  
index created!                                                                                                                                                                                                     
2021-11-17 17:31:13,772 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth

and then just waits.

Environment

sys.platform: linux
Python: 3.8.12 (default, Nov 17 2021, 08:17:37) [GCC 9.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.11.1+cu113
OpenCV: 4.5.4-dev
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.18.1+c76ab0e

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 26 (4 by maintainers)

Most upvoted comments

My solution is below: Add following commands in ~/.bashrc

export NCCL_P2P_DISABLE=“1” export NCCL_IB_DISABLE=“1”

Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.

@RangiLyu @hhaAndroid I was able to reproduce this issue with Docker, I tried a broad range of settings:

Pytorch Cuda CuDNN Works
1.6.0 10.1 7 ✔️
1.7.0 11.0 8
1.8.0 11.1 8
1.9.0 10.2 7
1.9.0 11.1 8
1.10.0 11.3 8
Dockerfile

Base on the official Dockerfile:

ARG PYTORCH="1.6.0"
ARG CUDA="10.1"
ARG CUDNN="7"


FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

ARG MMCV="1.4.6"

ARG PYTORCH
ARG CUDA

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install MMCV
RUN pip install --no-cache-dir --upgrade pip wheel setuptools
RUN ["/bin/bash", "-c", "pip install mmcv-full==${MMCV} -f https://download.openmmlab.com/mmcv/dist/cu${CUDA//./}/torch${PYTORCH}/index.html"]

# Install MMDetection
RUN conda clean --all
RUN git clone https://github.com/open-mmlab/mmdetection.git /mmdetection
WORKDIR /mmdetection
ENV FORCE_CUDA="1"
RUN pip install --no-cache-dir -r requirements/build.txt
RUN pip install --no-cache-dir -e .

Build:

docker build \
    --build-arg PYTORCH=<pytorch> \
    --build-arg CUDA=<cuda> \
    --build-arg CUDNN=<cudnn> \
    -t mmdet:<pytorch>-<cuda>-cudnn> .

After some digging, I found this discussion, so I tried using the gloo backend instead of nccl, i.e. in configs/_base_/default_runtime.py I changed dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo').

That does make it work on all the test scenarios:

Pytorch Cuda CuDNN Works
1.6.0 10.1 7 ✔️
1.7.0 11.0 8 ✔️
1.8.0 11.1 8 ✔️
1.9.0 10.2 7 ✔️
1.9.0 11.1 8 ✔️
1.10.0 11.3 8 ✔️

Please note, that the Cuda version that you used (9.0) to reproduce this issue is really old, it was released in 2017, I couldn’t even find a Pytorch Docker image for that. It might make sense to test the code using versions that are more commonly used recently (10.2, 11.3). I’m not sure Pytorch even supports anything less than 10.2, see here.

Please, verify if this is an actual bug that you can reproduce.

@RangiLyu @hhaAndroid I was able to reproduce this issue with Docker, I tried a broad range of settings:

Pytorch Cuda CuDNN Works 1.6.0 10.1 7 ✔️ 1.7.0 11.0 8 ❌ 1.8.0 11.1 8 ❌ 1.9.0 10.2 7 ❌ 1.9.0 11.1 8 ❌ 1.10.0 11.3 8 ❌ Dockerfile Base on the official Dockerfile:

ARG PYTORCH="1.6.0"
ARG CUDA="10.1"
ARG CUDNN="7"


FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

ARG MMCV="1.4.6"

ARG PYTORCH
ARG CUDA

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install MMCV
RUN pip install --no-cache-dir --upgrade pip wheel setuptools
RUN ["/bin/bash", "-c", "pip install mmcv-full==${MMCV} -f https://download.openmmlab.com/mmcv/dist/cu${CUDA//./}/torch${PYTORCH}/index.html"]

# Install MMDetection
RUN conda clean --all
RUN git clone https://github.com/open-mmlab/mmdetection.git /mmdetection
WORKDIR /mmdetection
ENV FORCE_CUDA="1"
RUN pip install --no-cache-dir -r requirements/build.txt
RUN pip install --no-cache-dir -e .

Build:

docker build \
    --build-arg PYTORCH=<pytorch> \
    --build-arg CUDA=<cuda> \
    --build-arg CUDNN=<cudnn> \
    -t mmdet:<pytorch>-<cuda>-cudnn> .

After some digging, I found this discussion, so I tried using the gloo backend instead of nccl, i.e. in configs/_base_/default_runtime.py I changed dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo').

That does make it work on all the test scenarios:

Pytorch Cuda CuDNN Works 1.6.0 10.1 7 ✔️ 1.7.0 11.0 8 ✔️ 1.8.0 11.1 8 ✔️ 1.9.0 10.2 7 ✔️ 1.9.0 11.1 8 ✔️ 1.10.0 11.3 8 ✔️ Please note, that the Cuda version that you used (9.0) to reproduce this issue is really old, it was released in 2017, I couldn’t even find a Pytorch Docker image for that. It might make sense to test the code using versions that are more commonly used recently (10.2, 11.3). I’m not sure Pytorch even supports anything less than 10.2, see here.

Please, verify if this is an actual bug that you can reproduce.

I experienced a similar problem. It works for me. The outputs are as follows.

2022-04-01 12:15:36,770 - mmseg - INFO - Set random seed to 0, deterministic: False
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/code/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
2022-04-01 12:15:37,557 - mmseg - INFO - initialize ResNetV1c with init_cfg {'type': 'Pretrained', 'checkpoint': 'open-mmlab://resnet101_v1c'}
2022-04-01 12:15:37,558 - mmcv - INFO - load model from: open-mmlab://resnet101_v1c
2022-04-01 12:15:37,558 - mmcv - INFO - load checkpoint from openmmlab path: open-mmlab://resnet101_v1c
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800337 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800337 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800795 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800809 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800795 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801043 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800809 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801043 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2000 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2001 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2002 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2004 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2005 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2006 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 2003) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
tools/train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2022-04-01_12:45:45
  host      : d7537e2e1710
  rank      : 7 (local_rank: 7)
  exitcode  : -6 (pid: 2007)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2007
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-01_12:45:45
  host      : d7537e2e1710
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 2003)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2003
=====================================================

I also changed dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo').

It works for me. Thanks!