mmaction2: RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug
The following error appears upon finishing epoch 240 (last one):
Default process group has not been initialized, please make sure to call init_process_group.
Additionally, it seems as if the classifier does not learn at all - accuracy remains the same as it was in the first steps even after 240 epochs. Reproduction
- What command or script did you run?
python train.py
configs/skeleton/posec3d/my_config.py
--work-dir
work_dirs/my_workdir
--validate
--test-best
--gpus
1
--seed
0
--deterministic
- Did you make any modifications on the code or config? Did you understand what you have modified?
Only the configuration changes mentioned in: https://github.com/open-mmlab/mmaction2/blob/master/configs/skeleton/posec3d/custom_dataset_training.md
- What dataset did you use?
A private dataset of real people performing several pre-defined types of repetitive actions. This dataset contains approximately 5000 samples with 13 different classes.
Environment
'tail' is not recognized as an internal or external command,
operable program or batch file.
'gcc' is not recognized as an internal or external command,
operable program or batch file.
sys.platform: win32
Python: 3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 3080
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3
NVCC: Not Available
GCC: n/a
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
- C++ Version: 199711
- MSVC 192829337
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 2019
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX512
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.4
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,
TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.3.8
MMCV Compiler: MSVC 192829912
MMCV CUDA Compiler: 11.3
MMAction2: 0.20.0+61d7eb8
Error traceback
If applicable, paste the error traceback here.
021-12-23 22:57:06,396 - mmaction - INFO -
top1_acc 0.2789
top5_acc 0.7224
2021-12-23 22:57:06,396 - mmaction - INFO - Evaluating mean_class_accuracy ...
2021-12-23 22:57:06,398 - mmaction - INFO -
mean_acc 0.0769
2021-12-23 22:57:06,398 - mmaction - INFO - Epoch(val) [240][98] top1_acc: 0.2789, top5_acc: 0.7224, mean_class_accuracy: 0.0769
2021-12-23 22:57:08,563 - mmaction - INFO - 972 videos remain after valid thresholding
2021-12-23 22:57:08,564 - mmaction - INFO - load checkpoint from E:\mmaction2\work_dirs\autism_center\best_top1_acc_epoch_20.pth
2021-12-23 22:57:08,564 - mmaction - INFO - Use load_from_local loader
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 972/972, 1.1 task/s, elapsed: 907s, ETA: 0sTraceback (most recent call last):
File "E:/mmaction2/tools/train.py", line 201, in <module>
main()
File "E:/mmaction2/tools/train.py", line 197, in main
meta=meta)
File "E:\mmaction2\mmaction\apis\train.py", line 254, in train_model
gpu_collect)
File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 86, in multi_gpu_test
results = collect_results_cpu(results, len(dataset), tmpdir)
File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 129, in collect_results_cpu
dist.barrier()
File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 2708, in barrier
default_pg = _get_default_group()
File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 411, in _get_default_group
"Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Process finished with exit code 1
I couldn’t identify the cause for this error, nor the reason for the low accuracy. Other skeleton-based algorithms managed to learn on this dataset.
Thanks in advance!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (9 by maintainers)
Has this problem fixed by PR #1459?
Sorry, the answer seems to be not related to your problem: setting video shape is just required for PoseC3D models.
@JXIONG008
I have the same problem as you. Have you solved it?