accelerate: Freeze on infiniband with slurm
System Info
`Accelerate` version: 0.19.0
- Platform: Linux-4.18.0-425.13.1.el8_7.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.4
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (False)
- System RAM: 503.49 GB
- `Accelerate` default config:
Not found
Using the command-line parameters:
srun bash -c 'accelerate launch \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--multi_gpu \
--mixed_precision=no \
--num_processes=$(($NNODES * 4)) \
--dynamo_backend=no \
--num_machines=$NNODES \
--machine_rank=$SLURM_NODEID \
--rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT,rdzv_backend=c10d" \
distrib.py'
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Fails with multi-node.
This machine has the infiniband interfaces suffixed with i, so a compute node responds to hostname with something like juwels07 but the right interface is juwels07i. There’s some script magic for that on the script.
Slurm launch is:
#!/bin/bash -x
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4
# srun doesnot inherit cpus-per-task from sbatch
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"
export MASTER_PORT=7010
export GPUS_PER_NODE=4
export NNODES=$SLURM_JOB_NUM_NODES
# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1
# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json
# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
# handle timeouts
export NCCL_IB_TIMEOUT=20
# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src
# This loads modules and python packages
source sc_venv_template/activate.sh
export LOGLEVEL=INFO
# Run the demo
time srun bash -c 'accelerate launch \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--multi_gpu \
--mixed_precision=no \
--num_processes=$(($NNODES * 4)) \
--dynamo_backend=no \
--num_machines=$NNODES \
--machine_rank=$SLURM_PROCID \
--rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT rdzv_backend=c10d" \
distrib.py'
This is the error output, and it stays like this until the job times out (the identical job but with only one node works):
+
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : distrib.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 4
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.13.23.78:7010
rdzv_configs : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 1, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : distrib.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 4
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.13.23.78:7010
rdzv_configs : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic__7nlgv08/none_jscf2i4f
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=10.13.23.78
master_port=7010
group_rank=1
group_world_size=2
local_ranks=[0, 1, 2, 3]
role_ranks=[4, 5, 6, 7]
global_ranks=[4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/3/error.json
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_j88hm8vm/none_tgket3up
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=10.13.23.78
master_port=7010
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2, 3]
role_ranks=[0, 1, 2, 3]
global_ranks=[0, 1, 2, 3]
role_world_sizes=[8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/3/error.json
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
This is the tail of the NCCL_DEBUG output:
jwb0093:11954:12012 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11954:12012 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0093:11953:12015 [1] NCCL INFO Connected all trees
jwb0093:11953:12015 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11953:12015 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0092:26778:26839 [2] NCCL INFO comm 0x3792dff0 rank 2 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE
jwb0092:26776:26834 [0] NCCL INFO comm 0x37fb73a0 rank 0 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0092:26777:26840 [1] NCCL INFO comm 0x37e42e20 rank 1 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11955:12013 [3] NCCL INFO comm 0x36e7c930 rank 7 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0092:26779:26841 [3] NCCL INFO comm 0x38ee4b30 rank 3 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0093:11953:12015 [1] NCCL INFO comm 0x379b7bc0 rank 5 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11952:12014 [0] NCCL INFO comm 0x36f49380 rank 4 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0093:11954:12012 [2] NCCL INFO comm 0x36be8130 rank 6 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE
The full error and output are here https://gist.github.com/surak/5f3f236616e5db48f19d31df457b4350
Expected behavior
A similar script works in an ethernet cluster. I would like to see what is actually frozen, but there’s no output other than that.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17
Got it, I’m with you now.
@surak we’ll have a new release soon on our usual release cycle
Works for me indeed! Thanks a lot!
In summary, the command @surak posted should be the following (just removing the
--same_networkline):The issue I linked to is patched with the PyTorch module we supply on the machine, but it still requires usage of
--rdzv_backend c10dand--rdzv_endpoint [...], IIRC. That’s why--rdzv_backend staticcan’t work on our machine. It’s complex and nasty, but that’s how we manage to work around the issue. :p@muellerz I’m pretty sure it’s correct that
rdzv_backendwas the issue. Our machine has had trouble since forever withtorch.distributed.run(whichaccelerateuses): https://github.com/pytorch/pytorch/issues/73656 That’s the reason why I forcedc10dusage via the YAML and it started working.Np @janEbert! It was on my list to get to, with how our CLI works as long as our argparser knows the param to use, that is usually all it takes to get it up and going (you can see literally I added a single line in the PR here: https://github.com/huggingface/accelerate/pull/1490)
(I also want to mildly change the internals so it’s not as static as it is rn, but that’s a when-i-have-time)
Thanks @muellerzr for the quick response, this was really lacking from
accelerate. 😃 There were other issues regarding--rdzv_backendthat were closed (e.g. https://github.com/huggingface/accelerate/pull/1337), but it makes total sense to have this parameter.The issue is that the command-line settings don’t set properly the rdvz_backend (which defaults to
static), while the yaml file does.Update:
Turns out that there’s something wrong with processing of the command line.
If I create a YAML file with the same parameters per node, I get it to run just fine:
So, this works: