accelerate: Multi-node setup, host can't connect to it's own provided IP address

Hi 🤗 I have a 2 Nodes, each of 8xA100 for a total of 16 GPUs. I’m utilizing SLURM for launching the jobs: SLURM scripts for the curious: https://rentry.co/9geu8n

Here, the main script uses the alloted 2 nodes and runs srun over it, i.e each node is given the PY file to execute once.

Env

  • Accelerate version: 0.13.0.dev0
  • Platform: Linux-5.10.126-117.518.amzn2.x86_64-x86_64-with-glibc2.10
  • Python version: 3.8.13
  • Numpy version: 1.22.4
  • PyTorch version (GPU?): 1.13.0a0+08820cb (True)
  • Accelerate default config: Not found

Now, I noticed a peculiar behavior. When on a single node (no SLURM, no multi-node, only multi-GPU) and run this:

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16 --machine_rank 0 --main_process_ip 172.... --main_process_port 69420 \
\
scripts/...

The script won’t run - the command simply executes, and I’m back the the command prompt again - no stdout or stderr.

But with

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16  \
\
scripts/torch_...

It works fine. The scripts runs alone on the 8 GPUs, and I can monitor the WandB logs.

This is a little quirk which puzzled me, and I can neither make head or tail of. I suspect it might mean something to someone here…


Multi-node training

For multi-node training, this is the PY script being executed: https://rentry.co/tz465

  • This script works correctly for multi-GPU cases, but NOT for multi-node

Most of it’s standard snippets, but it may have some glaring flaw

Output:

This is the output of the main sbatch script, which tells SLURM to deploy

Number of Nodes: 2
Name of all Hosts: gpu-st-p4d-24xlarge-60 gpu-st-p4d-24xlarge-61 # two nodes here, each 8xA100s
Master IP: 172.3.... # IP address of the main node
MASTER_PORT= 16543
ID: 0 # Each node reporting its RANK
ID: 1
NODE_COUNT=2 #number of nodes deployed

[18:14:34] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.                                        
[18:14:35] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.  
{Waiting about 15 mins}

[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).

Trying random ports yields no results.

I think it might be connected with the problem specified above. Does anyone have any idea?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21

Most upvoted comments

@neel04 should have a solution here soon for you to try. Thanks for the clear bug report!

An interesting bag of results. using the new torch.distributed.launch commands, the first one half works - it complains about local_rank but it waits at the ***** Setting part unless I run the same command on the second machine - which implies that there is some inter-node communication atleast.

I feel the error could be resolved after some effort, for which I will update later on 😃

The second command seems to work quite well 👌I wasn’t able to train more than a couple steps (pre-emption) but the synchronized initial loss leads me to believe that atleast the parameters synced initially - and since training worked, inter-node comms are working.

So it appears some problem in accelerate with multi-node setup’s networking. While torchrun works, I think I might need to add AMP to my setup for fp16. I’d still love to get to this issue’s core so that future users have no problem - and as such, I’m up for further debugging and testing on my side 🤗 let me know if you have any further stuff you might want me to try if that helps triaging the bug! 🚀

I’ve put the error traceback for the first command just in case, though I’m pretty sure I can get it to work.

Error @ command - 1 [Main Host]

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=0
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 5445) of binary: /home/awesome/awesome/anaconda3/envs/SUMO_dist/bin/python
Traceback (most recent call last):
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/torch_convnext.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 5446)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 5445)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@neel04 actually it’s now in main 😅 so just do pip install git+https://github.com/huggingface/accelerate!

@neel04 , wrt single-node multi-gpu, I am unable to reproduce the error then. the below command works fine:

accelerate launch --num_processes 2 --num_machines 1 --multi_gpu --mixed_precision "fp16" --machine_rank 0 \
--main_process_ip "192.xxx.x.xx" --main_process_port 8888 accelerate/examples/nlp_example.py 

wrt multi-node multi-gpu do you observe issues when launching using torchrun or torch.distributed.launch? Could you try the below ways of launching and check. Using torch.distributed.launch:

NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=2 --nno
des=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 --use_env accelerate/examples/
nlp_example.py
NCCL_DEBUG=INFO torchrun --nproc_per_node=2 --nnodes=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 accelerate/examples/nlp_example.py