mmaction2: RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug

The following error appears upon finishing epoch 240 (last one): Default process group has not been initialized, please make sure to call init_process_group.

Additionally, it seems as if the classifier does not learn at all - accuracy remains the same as it was in the first steps even after 240 epochs. Reproduction

  1. What command or script did you run?
python train.py
configs/skeleton/posec3d/my_config.py
--work-dir
work_dirs/my_workdir
--validate
--test-best
--gpus
1
--seed
0
--deterministic
  1. Did you make any modifications on the code or config? Did you understand what you have modified?

Only the configuration changes mentioned in: https://github.com/open-mmlab/mmaction2/blob/master/configs/skeleton/posec3d/custom_dataset_training.md

  1. What dataset did you use?

A private dataset of real people performing several pre-defined types of repetitive actions. This dataset contains approximately 5000 samples with 13 different classes.

Environment

'tail' is not recognized as an internal or external command,
operable program or batch file.
'gcc' is not recognized as an internal or external command,
operable program or batch file.
sys.platform: win32
Python: 3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 3080
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3
NVCC: Not Available
GCC: n/a
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.3.8
MMCV Compiler: MSVC 192829912
MMCV CUDA Compiler: 11.3
MMAction2: 0.20.0+61d7eb8

Error traceback

If applicable, paste the error traceback here.

021-12-23 22:57:06,396 - mmaction - INFO - 
top1_acc	0.2789
top5_acc	0.7224
2021-12-23 22:57:06,396 - mmaction - INFO - Evaluating mean_class_accuracy ...
2021-12-23 22:57:06,398 - mmaction - INFO - 
mean_acc	0.0769
2021-12-23 22:57:06,398 - mmaction - INFO - Epoch(val) [240][98]	top1_acc: 0.2789, top5_acc: 0.7224, mean_class_accuracy: 0.0769
2021-12-23 22:57:08,563 - mmaction - INFO - 972 videos remain after valid thresholding
2021-12-23 22:57:08,564 - mmaction - INFO - load checkpoint from E:\mmaction2\work_dirs\autism_center\best_top1_acc_epoch_20.pth
2021-12-23 22:57:08,564 - mmaction - INFO - Use load_from_local loader
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 972/972, 1.1 task/s, elapsed: 907s, ETA:     0sTraceback (most recent call last):
  File "E:/mmaction2/tools/train.py", line 201, in <module>
    main()
  File "E:/mmaction2/tools/train.py", line 197, in main
    meta=meta)
  File "E:\mmaction2\mmaction\apis\train.py", line 254, in train_model
    gpu_collect)
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 86, in multi_gpu_test
    results = collect_results_cpu(results, len(dataset), tmpdir)
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 129, in collect_results_cpu
    dist.barrier()
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 2708, in barrier
    default_pg = _get_default_group()
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 411, in _get_default_group
    "Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code 1

I couldn’t identify the cause for this error, nor the reason for the low accuracy. Other skeleton-based algorithms managed to learn on this dataset.

Thanks in advance!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

The problems have been fixed now. BTW, we highly recommend users to use distributed training and testing (you can use it even if you have only 1 GPU). The command for distributed training is just like: bash tools/dist_train.sh {config} {num_gpus} {other_args …}

Hi, I still met the "Default process group has not been initialized, " error when I run the mmaction2_tutorial.ipynb file with the latest version’s code. The accuracy is normal, but training shows the error in the 10th epoch.

Has this problem fixed by PR #1459?

Not sure, seems everything OK. BTW, have you set img_shape and original_shape as the real video shape (height, width) for each video?

Thank you very much for your work and answers. Where should I set the img_shape and original_shape in code?Cause I met the same error running slowfast model in my own dataset.

Sorry, the answer seems to be not related to your problem: setting video shape is just required for PoseC3D models.

@JXIONG008

I have the same problem as you. Have you solved it?