accelerate: Freeze on infiniband with slurm

System Info

`Accelerate` version: 0.19.0
- Platform: Linux-4.18.0-425.13.1.el8_7.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.4
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (False)
- System RAM: 503.49 GB
- `Accelerate` default config:
	Not found

Using the command-line parameters:

srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_NODEID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT,rdzv_backend=c10d" \
    distrib.py'

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Fails with multi-node.

This machine has the infiniband interfaces suffixed with i, so a compute node responds to hostname with something like juwels07 but the right interface is juwels07i. There’s some script magic for that on the script.

Slurm launch is:

#!/bin/bash -x
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4

# srun doesnot inherit cpus-per-task from sbatch
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"

export MASTER_PORT=7010
export GPUS_PER_NODE=4
export NNODES=$SLURM_JOB_NUM_NODES
# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1

# handle timeouts
export NCCL_IB_TIMEOUT=20

# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

export LOGLEVEL=INFO
# Run the demo
time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_PROCID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT rdzv_backend=c10d" \
    distrib.py'

This is the error output, and it stays like this until the job times out (the identical job but with only one node works):

+ 
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.23.78:7010
  rdzv_configs     : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 1, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.23.78:7010
  rdzv_configs     : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic__7nlgv08/none_jscf2i4f
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.13.23.78
  master_port=7010
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[4, 5, 6, 7]
  global_ranks=[4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/3/error.json
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_j88hm8vm/none_tgket3up
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.13.23.78
  master_port=7010
  group_rank=0
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/3/error.json
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).

This is the tail of the NCCL_DEBUG output:

jwb0093:11954:12012 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11954:12012 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0093:11953:12015 [1] NCCL INFO Connected all trees
jwb0093:11953:12015 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11953:12015 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0092:26778:26839 [2] NCCL INFO comm 0x3792dff0 rank 2 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE
jwb0092:26776:26834 [0] NCCL INFO comm 0x37fb73a0 rank 0 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0092:26777:26840 [1] NCCL INFO comm 0x37e42e20 rank 1 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11955:12013 [3] NCCL INFO comm 0x36e7c930 rank 7 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0092:26779:26841 [3] NCCL INFO comm 0x38ee4b30 rank 3 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0093:11953:12015 [1] NCCL INFO comm 0x379b7bc0 rank 5 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11952:12014 [0] NCCL INFO comm 0x36f49380 rank 4 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0093:11954:12012 [2] NCCL INFO comm 0x36be8130 rank 6 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE

The full error and output are here https://gist.github.com/surak/5f3f236616e5db48f19d31df457b4350

Expected behavior

A similar script works in an ethernet cluster. I would like to see what is actually frozen, but there’s no output other than that.

About this issue

Original URL
State: closed
Created a year ago
Comments: 17

Most upvoted comments

Got it, I’m with you now.

muellerzr on May 30, 2023

@surak we’ll have a new release soon on our usual release cycle

muellerzr on Jun 1, 2023

Works for me indeed! Thanks a lot!

surak on May 31, 2023

In summary, the command @surak posted should be the following (just removing the --same_network line):

time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision no \
    --num_processes=$(($SLURM_JOB_NUM_NODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$SLURM_JOB_NUM_NODES \
    --machine_rank=$SLURM_NODEID \
    --rdzv_backend c10d \
    distrib.py'

janEbert on May 30, 2023

The issue I linked to is patched with the PyTorch module we supply on the machine, but it still requires usage of --rdzv_backend c10d and --rdzv_endpoint [...], IIRC. That’s why --rdzv_backend static can’t work on our machine. It’s complex and nasty, but that’s how we manage to work around the issue. :p

janEbert on May 30, 2023

@muellerz I’m pretty sure it’s correct that rdzv_backend was the issue. Our machine has had trouble since forever with torch.distributed.run (which accelerate uses): https://github.com/pytorch/pytorch/issues/73656 That’s the reason why I forced c10d usage via the YAML and it started working.

janEbert on May 30, 2023

Np @janEbert! It was on my list to get to, with how our CLI works as long as our argparser knows the param to use, that is usually all it takes to get it up and going (you can see literally I added a single line in the PR here: https://github.com/huggingface/accelerate/pull/1490)

(I also want to mildly change the internals so it’s not as static as it is rn, but that’s a when-i-have-time)

muellerzr on May 30, 2023

Thanks @muellerzr for the quick response, this was really lacking from accelerate. 😃 There were other issues regarding --rdzv_backend that were closed (e.g. https://github.com/huggingface/accelerate/pull/1337), but it makes total sense to have this parameter.

janEbert on May 30, 2023

The issue is that the command-line settings don’t set properly the rdvz_backend (which defaults to static), while the yaml file does.

surak on May 30, 2023

Update:

Turns out that there’s something wrong with processing of the command line.

If I create a YAML file with the same parameters per node, I get it to run just fine:

So, this works:

#!/bin/bash
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4

# Without this, srun does not inherit cpus-per-task from sbatch.
export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"

# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"

export MASTER_PORT=7010

export GPUS_PER_NODE=4

# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

# Set up accelerate config.
export ACCELERATE_CONFIG_YAML=accelerate_config_"$SLURM_JOB_ID".yaml
srun bash -c "((\$SLURM_PROCID)) || cat <<EOT > \"\$ACCELERATE_CONFIG_YAML\"
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: \$SLURM_NODEID
main_process_ip: '\$MASTER_ADDR'
main_process_port: \$MASTER_PORT
main_training_function: main
mixed_precision: 'no'
num_machines: \$SLURM_JOB_NUM_NODES
num_processes: \$((SLURM_JOB_NUM_NODES * GPUS_PER_NODE))
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
EOT"

# Run the demo
time srun bash -c 'accelerate launch \
    --config_file=$ACCELERATE_CONFIG_YAML \
    distrib.py'

surak on May 30, 2023