pytorch-lightning: global process count incorrect with elastic, fault tolerant training

🐛 Bug

Problem

Count of the total number of processes incorrectly set.

Context

I am trying to run elastic training with torchelastic. I have tried with both gloo and nccl backends.

Error message

Error coming from gloo backend:

Traceback (most recent call last):
  File "train_hydra.py", line 20, in hydra_main
    train(cfg)
  File "/bdata/bdata1/sribkain/learnseis/learnseis/training.py", line 39, in train
    t.fit(module, data_module)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
    self.init_ddp_connection(self.global_rank, self.world_size)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
    torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 432, in init_process_group
    timeout=timeout)
  File "/ldata/Code/salt-identification/SRIBKAIN_ENVS/pl_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 503, in _new_process_group_helper
    timeout=timeout)
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/context.cc:27] rank < size. 13 vs 8

NCCL backend gives this error: https://github.com/pytorch/pytorch/issues/20313

Please reproduce using the BoringModel

I am running imagenet example from pl using torchvision.models.resnet34. Happy to reproduce with BoringModel if needed.

Before launching, I have exported the variable GLOO_SOCKET_IFNAME and set it to the appropriate interface name.

On node 0:

PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml

On node 1:

PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml

To Reproduce

Use following BoringModel and post here

Expected behavior

To be able to run distributed fault tolerant training 😃

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

Output of collect_env_details.py:

* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 1.2.6
        - tqdm:              4.48.2
        - torchelastic:    0.2.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.7.7
        - version:           #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020

Additional context

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 23 (14 by maintainers)

Most upvoted comments

@srib Would you like to try out my branch #6941 ? I believe I have the major stuff sorted out. Just not 100% about all multi node stuff

awaelchli on Apr 10, 2021

The next release is planned for tomorrow. This is the branch that we will release as 1.2.8: #6983

awaelchli on Apr 13, 2021

@awaelchli Sorry for the delay in responding. I had to make a few tweaks to my code as it was using the hydra configs from the latest master.

I can confirm that it works fine. It does assign the global ranks correctly now. Please note that I only tested it with gloo backend. Let me also run a test with nccl backend and I will confirm that it fixes my problem.

srib on Apr 12, 2021

true. would be good to know why this is there in the first place … 😦 maybe the least serious bug here, because it happens in post_dispatch so basically at the end of training. good observation

the issue is if someone runs trainer.fit() and then trainer.test() right afterward

ananthsub on Apr 10, 2021

@srib ok I think I understand now the bug you are experiencing. The world size is changing in the middle of training, and we are keeping the old value saved. And so we should read the world size always from the cluster environment (which will read it from the environment)

awaelchli on Apr 9, 2021

Sounds good yes. Let’s try it. This will affect all major plugins, so we need to be careful 😃

awaelchli on Apr 9, 2021