DeepSpeed: [RuntimeError: Connection reset by peer] When scaling up training jobs

I am facing a similar problem as the one posted by @g-karthik in https://github.com/microsoft/DeepSpeed/issues/570#issuecomment-750744107.

When I use 40 nodes with 10 gpus on each node (400 jobs), the training works well. But when I scale up the training to 40 or more nodes, deepspeed.initialize() fails with:

Traceback (most recent call last):  File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 947, in <module>
    main()  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 769, in main
    initialize_distributed(args)  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 703, in initialize_distributed
    deepspeed.init_distributed(distributed_port=29501)  
File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed                                          
    torch.distributed.init_process_group(backend=dist_backend,  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: Connection reset by peer

I used the deepspeed version at the master branch. I ran my script with mpirun, just as described in https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility.

Any ideas on what’s going on?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

Hey @jeffra! It looks like FB published a new Docker image for the latest PyTorch, with NCCL 2.9.6:

https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04

Seems like the LD_PRELOAD hack won’t be needed any more? I see your PyTorch PRs haven’t been merged but I am assuming they’re not needed.

Does DeepSpeed support this base image?

Hi @g-karthik, it should probably work? But I have not tried it myself with torch 1.4. Sorry for a less than confident answer there haha 😃