mmdetection: Multi-gpu training gets stuck
Checklist
- I have searched related issues but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug
Single GPU training works fine, single node multi-GPU doesn’t.
Relevant: #3823, #2193, #1979, #4535, maybe #3973
Rolling back intel-openmp doesn’t help, and it uses only default configs.
I.e. running this is ok:
python tools/train.py configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py
Outputs:
...
2021-11-17 17:33:20,582 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,583 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,584 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,586 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,588 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,589 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:33:20,598 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
2021-11-17 17:33:20,615 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}
2021-11-17 17:33:20,621 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
2021-11-17 17:33:23,497 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
2021-11-17 17:33:23,597 - mmdet - WARNING - The model and loaded state dict do not match exactly
size mismatch for roi_head.bbox_head.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([9, 1024]).
size mismatch for roi_head.bbox_head.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([9]).
size mismatch for roi_head.bbox_head.fc_reg.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([32, 1024]).
size mismatch for roi_head.bbox_head.fc_reg.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([32]).
2021-11-17 17:33:23,600 - mmdet - INFO - Start running, host: vince@wombat, work_dir: /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes
2021-11-17 17:33:23,600 - mmdet - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
...
--------------------
after_val_epoch:
(VERY_LOW ) TextLoggerHook
--------------------
after_run:
(VERY_LOW ) TextLoggerHook
--------------------
2021-11-17 17:33:23,600 - mmdet - INFO - workflow: [('train', 1)], max: 8 epochs
2021-11-17 17:33:23,600 - mmdet - INFO - Checkpoints will be saved to /home/vince/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_cityscapes by HardDiskBackend.
2021-11-17 17:33:54,886 - mmdet - INFO - Epoch [1][100/23720] lr: 1.988e-03, eta: 16:25:12, time: 0.312, data_time: 0.025, memory: 4024, loss_rpn_cls: 0.0439, loss_rpn_bbox: 0.0921, loss_cls: 0.7297, acc: 80.1914, loss_bbox: 0.3776, loss: 1.2433
2021-11-17 17:34:23,521 - mmdet - INFO - Epoch [1][200/23720] lr: 3.986e-03, eta: 15:44:40, time: 0.286, data_time: 0.004, memory: 4073, loss_rpn_cls: 0.0450, loss_rpn_bbox: 0.1027, loss_cls: 0.3770, acc: 86.8848, loss_bbox: 0.2467, loss: 0.7713
Running this is not ok:
./tools/dist_train.sh configs/cityscapes/faster_rcnn_r50_fpn_1x_cityscapes.py 2
Outputs:
...
2021-11-17 17:31:12,229 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,230 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,231 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,232 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,233 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,234 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,236 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,237 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,238 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,239 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,241 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,242 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,246 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,250 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,253 - mmdet - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}}
2021-11-17 17:31:12,270 - mmdet - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
2021-11-17 17:31:12,287 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}
2021-11-17 17:31:12,290 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]
loading annotations into memory...
Done (t=0.42s)
creating index...
index created!
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
2021-11-17 17:31:13,772 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
and then just waits.
Environment
sys.platform: linux
Python: 3.8.12 (default, Nov 17 2021, 08:17:37) [GCC 9.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.5.r11.5/compiler.30411180_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0+cu113
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.11.1+cu113
OpenCV: 4.5.4-dev
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.18.1+c76ab0e
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 26 (4 by maintainers)
My solution is below: Add following commands in ~/.bashrc
export NCCL_P2P_DISABLE=“1” export NCCL_IB_DISABLE=“1”
Then, source ~/.bashrc. It works for me. No need to modify nccl -> gool.
@RangiLyu @hhaAndroid I was able to reproduce this issue with Docker, I tried a broad range of settings:
Dockerfile
Base on the official Dockerfile:
Build:
After some digging, I found this discussion, so I tried using the
gloobackend instead ofnccl, i.e. inconfigs/_base_/default_runtime.pyI changeddist_params = dict(backend='nccl')todist_params = dict(backend='gloo').That does make it work on all the test scenarios:
Please note, that the Cuda version that you used (9.0) to reproduce this issue is really old, it was released in 2017, I couldn’t even find a Pytorch Docker image for that. It might make sense to test the code using versions that are more commonly used recently (10.2, 11.3). I’m not sure Pytorch even supports anything less than 10.2, see here.
Please, verify if this is an actual bug that you can reproduce.
I experienced a similar problem. It works for me. The outputs are as follows.
I also changed
dist_params = dict(backend='nccl')todist_params = dict(backend='gloo').It works for me. Thanks!