YOLOX: multi GPU training gets stuck after first validation
Does the evaluation keep some RAM allocated so that training starves?
Environment
- 8x NVIDIA A100-SXM4-40GB
- 960GB RAM allocated via Docker
docker run --shm-size=950g ..
data_num_workers = 0
e.g. with 8 GPUs, but also happens with 2 GPUs:
python -m yolox.tools.train -f exps/default/yolox_m.py -d 8 -b 64 -o --cache --fp16
end of log:
...
2022-05-13 18:29:45 | INFO | yolox.core.trainer:352 - Save weights to ./YOLOX_outputs/yolox_m
2022-05-13 18:30:38 | INFO | yolox.evaluators.coco_evaluator:235 - Evaluate in main process...
2022-05-13 18:30:49 | INFO | yolox.evaluators.coco_evaluator:268 - Loading and preparing results...
2022-05-13 18:30:55 | INFO | yolox.evaluators.coco_evaluator:268 - DONE (t=5.86s)
2022-05-13 18:30:55 | INFO | pycocotools.coco:366 - creating index...
2022-05-13 18:30:55 | INFO | pycocotools.coco:366 - index created!
[1/2] c++ -MMD -MF cocoeval.o.d -DTORCH_EXTENSION_NAME=fast_cocoeval -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /YOLOX/yolox/layers/cocoeval/cocoeval.cpp -o cocoeval.o
[2/2] c++ cocoeval.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o fast_cocoeval.so
2022-05-13 18:31:10 | INFO | yolox.layers.jit_ops:111 - Load fast_cocoeval op in 14.323s.
2022-05-13 18:31:29 | INFO | yolox.core.trainer:342 -
Average forward time: 17.18 ms, Average NMS time: 3.09 ms, Average inference time: 20.28 ms
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.197
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.340
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.203
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.100
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.221
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.241
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.209
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.353
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.382
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.199
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.423
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.471
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16 (5 by maintainers)
Hey guys. I found a workaround in my case.
(please fix me if I’m mistaken😉)
only allows IP socket communication
try setting these variables, in the scripts or in python by using
os.environ[XXX] = ...
The NCCL_P2P_DISABLE variable disables the peer to peer (P2P) transport, which uses CUDA direct access between GPUs, using NVLink or PCI.
The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.
explanation
For distributed training, subprocesses are always initialized dist.init_process_group by IP
So, IP communication seems reasonable.
My colleague told me that on other machines, it may not need this kind of disabling and falling back to IP. I donnot know why, either.
caution
NCCL_LL_THRESHOLD is often set as zero. I don’t know why.
export NCCL_LL_THRESHOLD=0
Caution ❗ They may influence the model performance.
https://github.com/NVIDIA/nccl/issues/369#issue-678319427
change of start method
In this commit and before, multi-gpu subprocesses are started by
launch_by_subprocess
, which callssubprocess.Popen
I use YOLOX in ByteTrack. It uses the old version of starting multiple processes
In the subsequent commits, start methods are set to
mp.start_processes
, and there may also be some other related but hidden changes.I haven’t check whether switching to the later versions could directly fix the problem.❤️
I’m not familiar with CUDA or NCCL.
However, I think this workaround makes sense, in that it considers the communication between gpus, and the bug lie in the SYNCHRONIZATION.😃