DeepSpeed: [RuntimeError: Connection reset by peer] When scaling up training jobs
I am facing a similar problem as the one posted by @g-karthik in https://github.com/microsoft/DeepSpeed/issues/570#issuecomment-750744107.
When I use 40 nodes with 10 gpus on each node (400 jobs), the training works well. But when I scale up the training to 40 or more nodes, deepspeed.initialize()
fails with:
Traceback (most recent call last): File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 947, in <module>
main()
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 769, in main
initialize_distributed(args)
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 703, in initialize_distributed
deepspeed.init_distributed(distributed_port=29501)
File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed
torch.distributed.init_process_group(backend=dist_backend,
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: Connection reset by peer
I used the deepspeed version at the master branch. I ran my script with mpirun
, just as described in https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility.
Any ideas on what’s going on?
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 15 (6 by maintainers)
Hey @jeffra! It looks like FB published a new Docker image for the latest PyTorch, with NCCL 2.9.6:
https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04
Seems like the LD_PRELOAD hack won’t be needed any more? I see your PyTorch PRs haven’t been merged but I am assuming they’re not needed.
Does DeepSpeed support this base image?
Hi @g-karthik, it should probably work? But I have not tried it myself with torch 1.4. Sorry for a less than confident answer there haha 😃