YOLOX: multi GPU training gets stuck after first validation

Does the evaluation keep some RAM allocated so that training starves?

Environment

  • 8x NVIDIA A100-SXM4-40GB
  • 960GB RAM allocated via Docker
  • docker run --shm-size=950g ..
  • data_num_workers = 0

e.g. with 8 GPUs, but also happens with 2 GPUs:

python -m yolox.tools.train -f exps/default/yolox_m.py -d 8 -b 64 -o --cache --fp16

end of log:

...
2022-05-13 18:29:45 | INFO     | yolox.core.trainer:352 - Save weights to ./YOLOX_outputs/yolox_m
2022-05-13 18:30:38 | INFO     | yolox.evaluators.coco_evaluator:235 - Evaluate in main process...
2022-05-13 18:30:49 | INFO     | yolox.evaluators.coco_evaluator:268 - Loading and preparing results...
2022-05-13 18:30:55 | INFO     | yolox.evaluators.coco_evaluator:268 - DONE (t=5.86s)
2022-05-13 18:30:55 | INFO     | pycocotools.coco:366 - creating index...
2022-05-13 18:30:55 | INFO     | pycocotools.coco:366 - index created!
[1/2] c++ -MMD -MF cocoeval.o.d -DTORCH_EXTENSION_NAME=fast_cocoeval -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /YOLOX/yolox/layers/cocoeval/cocoeval.cpp -o cocoeval.o 
[2/2] c++ cocoeval.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o fast_cocoeval.so
2022-05-13 18:31:10 | INFO     | yolox.layers.jit_ops:111 - Load fast_cocoeval op in 14.323s.
2022-05-13 18:31:29 | INFO     | yolox.core.trainer:342 - 
Average forward time: 17.18 ms, Average NMS time: 3.09 ms, Average inference time: 20.28 ms
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.197
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.340
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.203
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.100
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.221
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.241
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.209
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.353
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.199
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.423
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.471

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Hey guys. I found a workaround in my case.

(please fix me if I’m mistaken😉)

only allows IP socket communication

try setting these variables, in the scripts or in python by using os.environ[XXX] = ...

export NCCL_LL_THRESHOLD=0
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

The NCCL_P2P_DISABLE variable disables the peer to peer (P2P) transport, which uses CUDA direct access between GPUs, using NVLink or PCI.

The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.

explanation

For distributed training, subprocesses are always initialized dist.init_process_group by IP

So, IP communication seems reasonable.

My colleague told me that on other machines, it may not need this kind of disabling and falling back to IP. I donnot know why, either.

caution

NCCL_LL_THRESHOLD is often set as zero. I don’t know why.

export NCCL_LL_THRESHOLD=0

Caution ❗ They may influence the model performance.

https://github.com/NVIDIA/nccl/issues/369#issue-678319427

change of start method

In this commit and before, multi-gpu subprocesses are started by launch_by_subprocess, which calls subprocess.Popen

I use YOLOX in ByteTrack. It uses the old version of starting multiple processes

launch_by_subprocess(
        sys.argv,
        world_size,
        num_machines,
        machine_rank,
        num_gpus_per_machine,
        dist_url,
        args,
    )

In the subsequent commits, start methods are set to mp.start_processes, and there may also be some other related but hidden changes.

I haven’t check whether switching to the later versions could directly fix the problem.❤️

I’m not familiar with CUDA or NCCL.

However, I think this workaround makes sense, in that it considers the communication between gpus, and the bug lie in the SYNCHRONIZATION.😃