accelerate: Multi-node setup, host can't connect to it's own provided IP address

Hi 🤗 I have a 2 Nodes, each of 8xA100 for a total of 16 GPUs. I’m utilizing SLURM for launching the jobs: SLURM scripts for the curious: https://rentry.co/9geu8n

Here, the main script uses the alloted 2 nodes and runs srun over it, i.e each node is given the PY file to execute once.

Env

Accelerate version: 0.13.0.dev0
Platform: Linux-5.10.126-117.518.amzn2.x86_64-x86_64-with-glibc2.10
Python version: 3.8.13
Numpy version: 1.22.4
PyTorch version (GPU?): 1.13.0a0+08820cb (True)
Accelerate default config: Not found

Now, I noticed a peculiar behavior. When on a single node (no SLURM, no multi-node, only multi-GPU) and run this:

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16 --machine_rank 0 --main_process_ip 172.... --main_process_port 69420 \
\
scripts/...

The script won’t run - the command simply executes, and I’m back the the command prompt again - no stdout or stderr.

But with

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16  \
\
scripts/torch_...

It works fine. The scripts runs alone on the 8 GPUs, and I can monitor the WandB logs.

This is a little quirk which puzzled me, and I can neither make head or tail of. I suspect it might mean something to someone here…

Multi-node training

For multi-node training, this is the PY script being executed: https://rentry.co/tz465

This script works correctly for multi-GPU cases, but NOT for multi-node

Most of it’s standard snippets, but it may have some glaring flaw

Output:

This is the output of the main sbatch script, which tells SLURM to deploy

Number of Nodes: 2
Name of all Hosts: gpu-st-p4d-24xlarge-60 gpu-st-p4d-24xlarge-61 # two nodes here, each 8xA100s
Master IP: 172.3.... # IP address of the main node
MASTER_PORT= 16543
ID: 0 # Each node reporting its RANK
ID: 1
NODE_COUNT=2 #number of nodes deployed

[18:14:34] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.                                        
[18:14:35] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.  
{Waiting about 15 mins}

[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).

Trying random ports yields no results.

I think it might be connected with the problem specified above. Does anyone have any idea?

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 21

Most upvoted comments

@neel04 should have a solution here soon for you to try. Thanks for the clear bug report!

muellerzr on Aug 25, 2022

An interesting bag of results. using the new torch.distributed.launch commands, the first one half works - it complains about local_rank but it waits at the ***** Setting part unless I run the same command on the second machine - which implies that there is some inter-node communication atleast.

I feel the error could be resolved after some effort, for which I will update later on 😃

The second command seems to work quite well 👌I wasn’t able to train more than a couple steps (pre-emption) but the synchronized initial loss leads me to believe that atleast the parameters synced initially - and since training worked, inter-node comms are working.

So it appears some problem in accelerate with multi-node setup’s networking. While torchrun works, I think I might need to add AMP to my setup for fp16. I’d still love to get to this issue’s core so that future users have no problem - and as such, I’m up for further debugging and testing on my side 🤗 let me know if you have any further stuff you might want me to try if that helps triaging the bug! 🚀

I’ve put the error traceback for the first command just in case, though I’m pretty sure I can get it to work.

Error @ command - 1 [Main Host]

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=0
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 5445) of binary: /home/awesome/awesome/anaconda3/envs/SUMO_dist/bin/python
Traceback (most recent call last):
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/torch_convnext.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 5446)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 5445)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

neel04 on Aug 22, 2022

@neel04 actually it’s now in main 😅 so just do pip install git+https://github.com/huggingface/accelerate!

muellerzr on Sep 2, 2022

@neel04 , wrt single-node multi-gpu, I am unable to reproduce the error then. the below command works fine:

accelerate launch --num_processes 2 --num_machines 1 --multi_gpu --mixed_precision "fp16" --machine_rank 0 \
--main_process_ip "192.xxx.x.xx" --main_process_port 8888 accelerate/examples/nlp_example.py

wrt multi-node multi-gpu do you observe issues when launching using torchrun or torch.distributed.launch? Could you try the below ways of launching and check. Using torch.distributed.launch:

NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=2 --nno
des=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 --use_env accelerate/examples/
nlp_example.py

NCCL_DEBUG=INFO torchrun --nproc_per_node=2 --nnodes=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 accelerate/examples/nlp_example.py

pacman100 on Aug 22, 2022