mmdetection: deadlock using Wandb

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug Hello mmdet developers,

We found the training loop can be dead lock in some places if we use multiGPU training and enable wandb tracking. Single GPU works perfectly fine. I only tested with YOLOX. Please see the command below.

Reproduction

  1. What command or script did you run?
./tools/dist_train.sh ./configs/yolox/yolox_s_8x8_300e_coco.py 2
  1. Did you make any modifications on the code or config? Did you understand what you have modified? No
  2. What dataset did you use? MSCOCO

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True GPU 0,1: Quadro GV100 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.3.r11.3/compiler.29745058_0 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.10.0 PyTorch compiling details: PyTorch built with:
  • GCC 7.3
  • C++ Version: 201402
  • Intel® oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel® 64 architecture applications
  • Intel® MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.0 OpenCV: 4.5.5 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.3 MMDetection: 2.25.0+ca11860

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) We used the provided docker.

Error traceback If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 19 (3 by maintainers)

Most upvoted comments

Finally managed to solve this by setting reset_flag=True in TextLoggerHook. (Although it’s kind of an ugly fix…)

i.e. use config like the following:

log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', reset_flag=True),
        dict(type='MMDetWandbHook', 'Set MMDetWandbHook properly here')
    ])

As far as I investigated, the deadlock was caused by dist.all_reduce() here: https://github.com/open-mmlab/mmdetection/blob/v2.25.0/mmdet/models/detectors/base.py#L204

The reason seems to be:

  • reset_flag of the logger hook which has the lowest priority should be set to True in order to clear the runner.log_buffer after logging is done.
  • If you put MMDetWandbHook after the TextLoggerHook in your config file. MMDetWandbHook has the lowest priority thus its reset_flag is True, in all processes for all the GPUs.
  • While setting reset_flag=True for MMDetWandbHook is fine on GPU 0, the real LAST HOOK on GPUs besides 0, TextLoggerHook, didn’t have reset_flag=True properly set.
    • because MMDetWandbHook is basically master_only
  • Consequently, on GPUs besides 0, TextLoggerHook never clear runner.log_buffer and execute dist.reduce(mem_mb, 0, op=dist.ReduceOp.MAX) to collect memory size, ignoring the logging interval. dist.reduce seems to have blocked dist.all_reduce() which I mentioned in the beginning… But I’m not quite sure.

p.s. I used the following environment

  • mmdet==2.25.0
    • ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
  • mmcv==1.5.0
  • wandb==0.12.17
  • multi-GPU machine

I experience same phenomena(deadlock over 30 minute) on dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py, only on distributed learning(and more than 1 gpu) setting. 1 gpu training is okay.