accelerate: accelerator.prepare(model) hangs when Multi-GPUs training on a single machine
My simple code hangs when I use accelerator.prepare (model).
System Info
one machine
4 GPUs (A40)
accelerator config:
- `Accelerate` version: 0.20.3
- Platform: Linux-5.15.0-72-generic-x86_64-with-glibc2.35
- Python version: 3.10.9
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.0.1 (True)
- PyTorch XPU available: False
- System RAM: 251.54 GB
- GPU type: NVIDIA A40
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
import torch import torch.nn as nn from accelerate import Accelerator
if name == “main”: accelerator = Accelerator() model = nn.Conv2d(10, 20, 3, 1, 1) accelerator.print(“1”) model = accelerator.prepare(model) accelerator.print(“2”)
Expected behavior
running command: accelerate launch --gpu_ids 0,1,2,3 --main_process_port 29501 ./acc_test.py
Output: print (“1”). accelerator.prepare(model) hangs running command NCCL_DEBUG=INFO accelerate launch --gpu_ids 0,1,2,3 --main_process_port 29501 ./Acc_test.py
and the debug info is :
1 rk019192:2421685:2421685 [0] NCCL INFO Bootstrap : Using eno1:192.168.58.3<0> rk019192:2421685:2421685 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation rk019192:2421685:2421685 [0] NCCL INFO cudaDriverVersion 12010 NCCL version 2.14.3+cuda11.8 rk019192:2421686:2421686 [1] NCCL INFO cudaDriverVersion 12010 rk019192:2421687:2421687 [2] NCCL INFO cudaDriverVersion 12010 rk019192:2421688:2421688 [3] NCCL INFO cudaDriverVersion 12010 rk019192:2421687:2421687 [2] NCCL INFO Bootstrap : Using eno1:192.168.58.3<0> rk019192:2421686:2421686 [1] NCCL INFO Bootstrap : Using eno1:192.168.58.3<0> rk019192:2421686:2421686 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation rk019192:2421687:2421687 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation rk019192:2421688:2421688 [3] NCCL INFO Bootstrap : Using eno1:192.168.58.3<0> rk019192:2421688:2421688 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation rk019192:2421685:2421710 [0] NCCL INFO Failed to open libibverbs.so[.1] rk019192:2421685:2421710 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.58.3<0> rk019192:2421685:2421710 [0] NCCL INFO Using network Socket rk019192:2421687:2421711 [2] NCCL INFO Failed to open libibverbs.so[.1] rk019192:2421688:2421713 [3] NCCL INFO Failed to open libibverbs.so[.1] rk019192:2421686:2421712 [1] NCCL INFO Failed to open libibverbs.so[.1] rk019192:2421686:2421712 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.58.3<0> rk019192:2421686:2421712 [1] NCCL INFO Using network Socket rk019192:2421687:2421711 [2] NCCL INFO NET/Socket : Using [0]eno1:192.168.58.3<0> rk019192:2421688:2421713 [3] NCCL INFO NET/Socket : Using [0]eno1:192.168.58.3<0> rk019192:2421687:2421711 [2] NCCL INFO Using network Socket rk019192:2421688:2421713 [3] NCCL INFO Using network Socket rk019192:2421688:2421713 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000 rk019192:2421685:2421710 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff rk019192:2421686:2421712 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff rk019192:2421687:2421711 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000 rk019192:2421686:2421712 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 rk019192:2421685:2421710 [0] NCCL INFO Channel 00/02 : 0 1 2 3 rk019192:2421685:2421710 [0] NCCL INFO Channel 01/02 : 0 1 2 3 rk019192:2421687:2421711 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 rk019192:2421685:2421710 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 rk019192:2421688:2421713 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 rk019192:2421685:2421710 [0] NCCL INFO Channel 00/0 : 0[4f000] -> 1[57000] via P2P/IPC rk019192:2421685:2421710 [0] NCCL INFO Channel 01/0 : 0[4f000] -> 1[57000] via P2P/IPC rk019192:2421688:2421713 [3] NCCL INFO Channel 00 : 3[d5000] -> 0[4f000] via SHM/direct/direct rk019192:2421688:2421713 [3] NCCL INFO Channel 01 : 3[d5000] -> 0[4f000] via SHM/direct/direct rk019192:2421686:2421712 [1] NCCL INFO Channel 00 : 1[57000] -> 2[d1000] via SHM/direct/direct rk019192:2421686:2421712 [1] NCCL INFO Channel 01 : 1[57000] -> 2[d1000] via SHM/direct/direct rk019192:2421687:2421711 [2] NCCL INFO Channel 00/0 : 2[d1000] -> 3[d5000] via P2P/IPC rk019192:2421687:2421711 [2] NCCL INFO Channel 01/0 : 2[d1000] -> 3[d5000] via P2P/IPC rk019192:2421687:2421711 [2] NCCL INFO Connected all rings rk019192:2421688:2421713 [3] NCCL INFO Connected all rings rk019192:2421688:2421713 [3] NCCL INFO Channel 00/0 : 3[d5000] -> 2[d1000] via P2P/IPC rk019192:2421688:2421713 [3] NCCL INFO Channel 01/0 : 3[d5000] -> 2[d1000] via P2P/IPC rk019192:2421687:2421711 [2] NCCL INFO Channel 00 : 2[d1000] -> 1[57000] via SHM/direct/direct rk019192:2421687:2421711 [2] NCCL INFO Channel 01 : 2[d1000] -> 1[57000] via SHM/direct/direct rk019192:2421685:2421710 [0] NCCL INFO Connected all rings rk019192:2421688:2421713 [3] NCCL INFO Connected all trees rk019192:2421688:2421713 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 rk019192:2421688:2421713 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer rk019192:2421686:2421712 [1] NCCL INFO Connected all rings rk019192:2421686:2421712 [1] NCCL INFO Channel 00/0 : 1[57000] -> 0[4f000] via P2P/IPC rk019192:2421686:2421712 [1] NCCL INFO Channel 01/0 : 1[57000] -> 0[4f000] via P2P/IPC rk019192:2421685:2421710 [0] NCCL INFO Connected all trees rk019192:2421685:2421710 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 rk019192:2421685:2421710 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer rk019192:2421686:2421712 [1] NCCL INFO Connected all trees rk019192:2421686:2421712 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 rk019192:2421686:2421712 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer rk019192:2421687:2421711 [2] NCCL INFO Connected all trees rk019192:2421687:2421711 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 rk019192:2421687:2421711 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer rk019192:2421687:2421711 [2] NCCL INFO comm 0x593a4a50 rank 2 nranks 4 cudaDev 2 busId d1000 - Init COMPLETE rk019192:2421688:2421713 [3] NCCL INFO comm 0x58d4a600 rank 3 nranks 4 cudaDev 3 busId d5000 - Init COMPLETE rk019192:2421685:2421710 [0] NCCL INFO comm 0x593e7310 rank 0 nranks 4 cudaDev 0 busId 4f000 - Init COMPLETE rk019192:2421686:2421712 [1] NCCL INFO comm 0x56d02ca0 rank 1 nranks 4 cudaDev 1 busId 57000 - Init COMPLETE
So any help would be appreciated.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 21
My question is in line with the above. During multi-machine training, a similar error occurs when accelerator.prepare(model), pointing to NCCL Error. With the help of others, I solved the problem. The training was successful. Before training, execute the following code on the command line
thank you for your nice attention. but the same problem: RuntimeError: The server socket has failed to listen on any local network address. The server sockethas failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).